SoylentNews Comments | Help SoylentNews

Help SoylentNews — "Scraping" Web Pages

posted by takyon on Saturday June 04 2016, @03:30PM

from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

Original Submission

Starting Score:

point

Moderation

Informative=1, Total=1

Extra 'Informative' Modifier

Karma-Bonus Modifier

Total Score:

This discussion has been archived. No new comments can be posted.

Help SoylentNews — "Scraping" Web Pages | Log In/Create an Account | Top | 53 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Some Perl resources you might find useful...(Score: 3, Informative) by number6 on Saturday June 04 2016, @09:43PM

by number6 (1831) on Saturday June 04 2016, @09:43PM (#355282) Journal

Web::Scraper
http://search.cpan.org/dist/Web-Scraper/ [cpan.org]
Perl module, Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions, by Tatsuhiko Miyagawa.
* A blog article............http://perlgerl.wordpress.com/2011/02/27/219/
* A comment...............Web::Scraper is so effin awesome. You just yank up some LWP from disk,
add some scraper to it and whambo! you have yourself some local data that the web universe
used to call it's own. Follow that author on CPAN; he's a genius.

How to fetch urls? I want to build a Perl web scraper like Python's urllib
https://www.reddit.com/r/perl/comments/3oiwqn/how_to_fetch_urls_like_in_pythons_urllib/ [reddit.com]

Web Scraping with Perl & PhantomJS (headless WebKit browser)
article, Feb 2013, by Rob Hammond (Perl programmer)
http://blogs.perl.org/users/robhammond/2013/02/web-scraping-with-perl-phantomjs.html [perl.org]

Perl Proxy Scraper
https://phl4nk.wordpress.com/2015/04/11/perl-proxy-scraper/ [wordpress.com]
* A comment...............Neat! I toyed around with a one-liner that does about the same thing for standard output:
perl -Mojo -E 'say g($_)->body =~ /((?:\d{1,3}\.?){4}:\d{1,4})/g for @ARGV' $url1 $url2 ...

how-to-develop-a-good-scraper-on-perl - Google Search [google.com]

Starting Score:	1		point
Moderation		+1
Informative=1, Total=1
Extra 'Informative' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Help SoylentNews — "Scraping" Web Pages

Some Perl resources you might find useful...(Score: 3, Informative) by number6 on Saturday June 04 2016, @09:43PM