Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 3, Informative) by Hairyfeet on Sunday June 05 2016, @02:55AM
But there is an easy way to fix the "queue runs dry" issue which is where I've gotten every article I've posted here...Daily Rotation [dailyrotation.com] which is nothing but science and tech, VERY nerdy and right up this site's alley. As just a couple examples, just me grabbing the latest headlines...a missing features installer for XP [betanews.com] from the guy that made the MFI for Win 10, FBI says TOR child porn exploit wasn't malware [arstechnica.com] and Japan's battleship island [cnet.com] with some great pictures.
So if the site starts to run dry? just head over to Rotation and grab the headlines, its all the nerdy kind of stuff the old site USED to cover.
ACs are never seen so don't bother. Always ready to show SJWs for the racists they are.