Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 0) by Anonymous Coward on Saturday June 04 2016, @10:29PM
BlackWidow
by http://SoftByteLabs.com/ [softbytelabs.com]
Published June 11, 2004
This is a multi-function Internet tool. It is an offline browser, Web site scanner, a site mapping tool, a site ripper, a site mirroring tool and a FTP scanner. Use it to scan a site and create a complete profile of the site's structure, files, external links and link errors. Then use it to download the web site with its structure and files intact, to use as a site mirror or to be converted by BlackWidow into a locally linked site for off-line browsing and long-term reference. Or use it to scan for and download any selection of files in part of a site or in a group of sites.
BlackWidow will scan HTTP sites, SSL sites and FTP sites. You can access password-protected sites, use threads, pull links from Java Scripts and Java Scripts file, and resume broken downloads. View, edit and print the structure of a Web site, write your own plug-ins and automatically load your plug-ins. Will also scan Adobe Acrobat files for links.
https://archive.org/details/tucows_193802_BlackWidow [archive.org]