Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
Meta
posted by takyon on Saturday June 04 2016, @03:30PM   Printer-friendly
from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Runaway1956 on Saturday June 04 2016, @03:44PM

    by Runaway1956 (2926) Subscriber Badge on Saturday June 04 2016, @03:44PM (#355116) Journal

    How does Calibre fetch news? http://manual.calibre-ebook.com/news.html [calibre-ebook.com] I have no idea if Calibre is in a speaking relationship with Perl though.

    --
    “I have become friends with many school shooters” - Tampon Tim Walz
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 4, Insightful) by frojack on Saturday June 04 2016, @04:38PM

    by frojack (1554) on Saturday June 04 2016, @04:38PM (#355143) Journal

    I ran Calibre fetch news app for quite a while, but transfering the news to my e-reader was never worth the effort.
    It is rather in-discriminant and is designed to scrape the entire site rather than just an article, but the scripts and source are all available.

    But here we are, with the IRC channel tail wagging the website Dog, something I warned about when the IRC channel was first talked about.
    If people think so little of a story that a one-liner on IRC is all they can muster, maybe we don't need that story.

    --
    No, you are mistaken. I've always had this sig.