Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 3, Informative) by frojack on Saturday June 04 2016, @08:00PM
No original content.
Scrape of entire articles not just a fair use paragraph.
No multi-source stories.
No, you are mistaken. I've always had this sig.
(Score: 2) by Runaway1956 on Sunday June 05 2016, @02:21AM
I've done it the "hard way", and I've done it with the IRC submission bot.
I'll admit, I do feel just a little guilty doing it the "easy way". ~submit url - that's so fast and easy, any braindead idiot can do it. But, sometimes, I'm getting ready for work, stumble over something I deem worthy of submission, and I simply don't have TIME.
Besides which - consensus seems to be that the submitter shouldn't add his own remarks, opinions, or observations, so the only advantage of manual submissions is that you "can" list additional sources. (I do that almost automatically when the source is Faux Noise, LOL)
“I have become friends with many school shooters” - Tampon Tim Walz
(Score: 3, Insightful) by frojack on Sunday June 05 2016, @07:36AM
you "can" list additional sources.
Yeah, I noticed butthurt has made him that his personal signature.
Post as many sources as he can, but pick the twistedest version for his write-up.
I actually have no problem with submitters including comments and their own analysis. As long as they make it clear that is what they are doing.
No, you are mistaken. I've always had this sig.