SoylentNews Comments | Help SoylentNews

Help SoylentNews — "Scraping" Web Pages

posted by takyon on Saturday June 04 2016, @03:30PM

from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

Original Submission

Starting Score:

point

Moderation

Interesting=1, Total=1

Extra 'Interesting' Modifier

Karma-Bonus Modifier

Total Score:

This discussion has been archived. No new comments can be posted.

Help SoylentNews — "Scraping" Web Pages | Log In/Create an Account | Top | 53 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Perl CPAN's HTML::TreeBuilder Perl CPAN's HTML::TreeBuilder (Score: 3, Interesting) by canopic jug on Saturday June 04 2016, @04:18PM

by canopic jug (3949)

on Saturday June 04 2016, @04:18PM (#355130) Journal

You probably know about LWP to fetch the pages. To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath I've recently used both with wget to extract thousands of pages from a defunct CMS and restructure the output in to static HTML designed for human maintenance and convert some of the more obnoxious attempts at formatting to CSS and many chunks to SSI. It will work easily for screen scraping and will be easy to update when the target changes its layout. You might want to pipe the fetched pages through Tidy first to make sure they are valid HTML or at least well-formed XHTML.

There is a corresponding XML module, XML::TreeBuilder, that functions in much the same way if you need to process XML feeds.

--
Money is not free speech. Elections should not be auctions.

Starting Score:	1		point
Moderation		+1
Interesting=1, Total=1
Extra 'Interesting' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Re:Perl CPAN's HTML::TreeBuilder Re:Perl CPAN's HTML::TreeBuilder (Score: 2) by fliptop on Saturday June 04 2016, @07:30PM

by fliptop (1666) on Saturday June 04 2016, @07:30PM (#355221) Journal

To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath

HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.

--
Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.

Parent
- Re:Perl CPAN's HTML::TreeBuilder (Score: 2) by canopic jug on Saturday June 04 2016, @08:28PM
  
  by canopic jug (3949) on Saturday June 04 2016, @08:28PM (#355256) Journal
  
  HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.
  Yes, I looked at HTML::SimpleParse and HTML::TokeParser first. For the task I had, I quickly needed something more. With HTML::TreeBuilder you can make fairly complex selections and even replacements.
  But reading all of the summary again, I see that the problem is javascript, not so much the HTML parsing. I generally don't waste my time on sites that are so heavy with javascript that they won't work with it turned off. Those kinds of sites probably should be encouraged to die. If SN wants to keep those kinds of sites alive then maybe JavaScript::SpiderMonkey and / or WWW::Mechanize::Firefox could work to parse the garbage pages before processing. However, those scripts and any dependent applications really need to be severely chrooted and locked down.
  
  --
  Money is not free speech. Elections should not be auctions.
  
  Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Help SoylentNews — "Scraping" Web Pages

Perl CPAN's HTML::TreeBuilder Perl CPAN's HTML::TreeBuilder (Score: 3, Interesting) by canopic jug on Saturday June 04 2016, @04:18PM

Re:Perl CPAN's HTML::TreeBuilder Re:Perl CPAN's HTML::TreeBuilder (Score: 2) by fliptop on Saturday June 04 2016, @07:30PM

Re:Perl CPAN's HTML::TreeBuilder (Score: 2) by canopic jug on Saturday June 04 2016, @08:28PM