Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM
I would very much like to know more about headless browsers. I dind't know there were any. Care to provide links or other information about them? Or about readily available components from which to build one's own?
-- hendrik
(Score: 2) by isostatic on Saturday June 04 2016, @10:32PM
PhantomJS is one, full webkit blown browser with javascript engine, css 3 support, etc etc.
(Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM
There's a proliferation of headless browsers right now based around different Javascript engines. Here's a StackOverflow page with a ton of them: http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions. [stackoverflow.com] I've used PhantomJS, primarily as just a JavaScript REPL, but I know it has a lot of features.
I've also used WWW::Mechanize with Perl and it's easy to use. The only thing is, I think at this point it's a little primitive compared to the other options. Conceptually, I think it's a bit more lynx-like.
Selenium is probably the most extreme level of engineering invested in a headless browser. It has multiple language frameworks and it works with multiple browsers. I found it harder to get started with.
Write your congressman. Tell him he sucks.