SoylentNews Comments | Help SoylentNews

Help SoylentNews — "Scraping" Web Pages

posted by takyon on Saturday June 04 2016, @03:30PM

from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

Original Submission

Starting Score:

point

Moderation

Interesting=1, Total=1

Extra 'Interesting' Modifier

Karma-Bonus Modifier

Total Score:

This discussion has been archived. No new comments can be posted.

Help SoylentNews — "Scraping" Web Pages | Log In/Create an Account | Top | 53 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

HeuristicHeuristic (Score: 3, Interesting) by Non Sequor on Saturday June 04 2016, @04:46PM

by Non Sequor (1005) on Saturday June 04 2016, @04:46PM (#355145) Journal

First, the tool you need is a headless browser. These things exist, and they give you a console or API based way to load a page, including anything accomplished by JavaScript, and then allows you to get a DOM tree for the page elements. Some headless browsers even go as far as essentially doing all of the layout for graphical display and will allow you to generate screenshots or query DOM attributes related to display.

Now, once you're comfortable using a headless browser, the next step for what you want to do is to come up with a heuristic for grabbing an appropriate chunk of article text. I'm thinking a starting point might be to get a list of all of the document nodes with inner text and sort it by word count, possibly weighted by some DOM attribute that's indicative of the final rendered size of the text. From here, I think you can get a heuristic to identify the deepest node in the tree that contains the article body text: traverse the elements of the DOM tree using the parent relation starting form these listed elements, identify common parents, and stop when you have identified a node which is parent to x% of the [display size weighted] inner text.

Why do we want the deepest one? Because we want to avoid getting text from navigation bars, ads and comments sections. In general, all of those things should be made up of small chunks of text and seeding the search from the largest chunks of text will bias it towards the article section (plus weighting for display size may seal the deal).

Once you've identified that, either convert it all down to text and let the human editor pare it down, or take all of the text up to and including the first paragraph that is greater than n sentences long (for the sake of getting a decent amount of text from articles which start with a cutesy sequence of short paragraphs).

--
Write your congressman. Tell him he sucks.

Starting Score:	1		point
Moderation		+1
Interesting=1, Total=1
Extra 'Interesting' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Headless browser?Headless browser? (Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM

by hendrikboom (1125) on Saturday June 04 2016, @10:06PM (#355286) Homepage Journal

I would very much like to know more about headless browsers. I dind't know there were any. Care to provide links or other information about them? Or about readily available components from which to build one's own?
-- hendrik

Parent
- Re:Headless browser?(Score: 2) by isostatic on Saturday June 04 2016, @10:32PM
  
  by isostatic (365) on Saturday June 04 2016, @10:32PM (#355299) Journal
  
  PhantomJS is one, full webkit blown browser with javascript engine, css 3 support, etc etc.
  
  Parent
- Re:Headless browser?(Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM
  
  by Non Sequor (1005) on Saturday June 04 2016, @10:33PM (#355300) Journal
  
  There's a proliferation of headless browsers right now based around different Javascript engines. Here's a StackOverflow page with a ton of them: http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions. [stackoverflow.com] I've used PhantomJS, primarily as just a JavaScript REPL, but I know it has a lot of features.
  I've also used WWW::Mechanize with Perl and it's easy to use. The only thing is, I think at this point it's a little primitive compared to the other options. Conceptually, I think it's a bit more lynx-like.
  Selenium is probably the most extreme level of engineering invested in a headless browser. It has multiple language frameworks and it works with multiple browsers. I found it harder to get started with.
  
  --
  Write your congressman. Tell him he sucks.
  
  Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Help SoylentNews — "Scraping" Web Pages

HeuristicHeuristic (Score: 3, Interesting) by Non Sequor on Saturday June 04 2016, @04:46PM

Headless browser?Headless browser? (Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM

Re:Headless browser?(Score: 2) by isostatic on Saturday June 04 2016, @10:32PM

Re:Headless browser?(Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM