Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 0) by Anonymous Coward on Saturday June 04 2016, @03:50PM
Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
I use noscript set to block javascript by default and when I enable javascript I always use the "temporary for this page" setting. So if a site needs javascript I always have to reenable javascript manually each time I go there so I am super cognizant of when a site needs javascript.
My experience is that most news sites work better without javascript. It gets past a lot of paywalls too. The formatting can be ugly but firefox's "reader mode" fixes the ugliness nearly every time (and reader mode does not phone home or anything else sneaky).
(Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @05:13PM
I use noscript
+1000 Trendy Pseudo-Geek
Congratulations for being a follower!
(Score: -1, Troll) by Anonymous Coward on Saturday June 04 2016, @05:25PM
Shattap ya goddamn nigga-jew-boy.
(Score: -1, Redundant) by Anonymous Coward on Saturday June 04 2016, @05:41PM
Yeah that's right, and my dick's so big that when my black mammy had me circumcised she saved the skin and now I carry my foreskin around with me on a chain. I'ma beat yer ass with my massive foreskin, white goy!
(Score: 2) by Runaway1956 on Sunday June 05 2016, @02:27AM
Trendy has nothing to do with anything. I can't speak for anyone else, but I have limited bandwidth, shared among two people all the time, and as many as 8 people sometimes. Advertising can and will hog 75 to 95% of my bandwidth. How in hell can I browse the web, if advertising is consuming the lion's share of my 1 1/2 to 2 MB of bandwidth? I block everything - web fonts, scripts, cross-site scripting, known ad servers - then it is possible to share that limited bandwidth amont however many of us are online.
That doesn't even begin to describe how much I detest the practices of commercial interests on the internet. If I had unlimited bandwidth, I would still block all the shite that I block now.
Trendy? Hardly. The masses are consuming the commercial shite just as fast as the corporations can shovel it to them. Witness Facebook and the like.
“I have become friends with many school shooters” - Tampon Tim Walz