Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 3, Insightful) by Username on Saturday June 04 2016, @05:47PM
I haven’t used perl since the year 2000ish, but there should be some kind of inet or wget function or library you can use to get the page. You can fool most javasbullscript websites into html mode by using a mobile browser useragent. If it doesn’t work just delete the submission or website from scrape list. Fuck em.
I would do it the template way. Most news site do not change their templates very often, and if they do it would be a trivial matter of update the scraping script to look for id="newstitle" instead of class="titlenews". Could even have some web page with a bunch of input boxes which content is read and written from some config file that stores the lookup terms for the script to use so moderators can just update it without having to update the script. If the term isnt found the script can just can fall back to other methods to find the content but you’ll be likely to get a bunch of garbage. I’d make this auto generated article a submission itself that has to be approved and not automatically shown on the page just in case it does fuck up.
(Score: 2) by Username on Saturday June 04 2016, @06:12PM
PS: Also I’d compare the articles on the scrape list to other articles that were scraped at the same time and link them together. I’d probably get the first 2000 characters of the article, remove common words like "of|the|a|it" etc turn all spaces and multi spaces into one char 32 space. Then load the remaining words into an array using that space as the delimiter and find each word in other articles arrays and use the amount of common words to determine relevancy. Or maybe just find some keyword from a keyword list and tag the article with it.
Something like that.