Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.
It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.
So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?
Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
(Score: 2) by frojack on Sunday June 05 2016, @05:08PM
I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.
(I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)
I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.
And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.
No, you are mistaken. I've always had this sig.
(Score: 2) by martyb on Monday June 06 2016, @03:40AM
Thanks for the reasoned and civil response to my reply. I'll continue in that vein and reply only to that which you replied.
For the record, to my knowledge, there are only three: "Arthur T Knackerbracket", "exec", and "MrPlow". Based on the scraping tech taken from "Arthur", a new one is in the works which is called "x" at the moment.
That is one of the problems I regularly face as an editor. If the submission *seems* to hold together on its own, I succumb to efficiency instead of exhaustive examination, and generally accept it without too much rework, but it certainly depends on the submitter. For example, there is one regular submitter whose stories generally drop italics. Oh, a story from 'foo'? Need to check on x, y, and z.
I'm gaining a sense of which submitters are prone to submitting a too-short story. Then the challenge comes in finding out what they DID submit, and what was omitted, and putting it back together. Stricly from an editing standpoint, I'd take the too-much over the too-little submitted, with the caveat that if it is substantially the whole article, that the submitter make that known.
That was, in part, the motivation for original story; I find it easier to cut than to extend.
I argue with myself, too. I hope you have more success than I do! ;-) In some cases, it is exceedingly difficult to condense the story down any further than the original story. In those cases, I just muddle though as best I can and leave it to the curious to read the rest of the story. In other cases, I've most likely missed a salient paragraph (or two). It happens. I'd estimate about half of my editing is done after 10pm, and that is usually after a full day at work.
I generally try to take a light tough in my editing. In short, if it passes a "sniff test", i.e. the story holds together for the most part, is not too terribly biased, and covers something that I sense would be of interest to the community, it'll probably make it out pretty much as received. The others, well, time and energy permitting, I'll give a go at those, too. I just make sure to cinch up my belt, take a deep breath, and dive in.
It is generally not my intention to try and force reading of TFA, though I have no doubts it has happened on more than a few occasions.
I have a similar view on this. Include the links to whatever was provided in the submission. If here is another "angle" that was published elsewhere, feel free to mention it, but it would be helpful to include why it was deemed interesting beyond what was already submitted.
Why do you think I'd be surprised? That has happened to me countless times on my submissions, too. Further, I've encountered the same thing when editing a story. Spend a bunch of time formatting it, reviewing the sources, and then find something that just doesn't jive.
In some cases, there are enough stories in the submission queue that I can just ignore this one and try a different story. In other cases, the empty story is yelling at me. Then, as much work as it will be, I find I need to dive in and just try to make a silk purse out of half a sow's ear.
Going full circle, two of the bots submit little more than a URL, a title, and maybe a couple/few sentences at best. I'd sooner see the whole story be submitted, along with a caution that it includes the entire story text. That would be a flag to me that I cannot run the entire submission as a story and that I need to pare it down before pushing it out to the story queue. And that was a big part of the motivation for the original story's request -- trying to avoid a too-succinct story submission.
Okay, it is well nigh midnight and I am struggling to continue. I hope this response better explains my motivation and what I envisioned as a possible path that would improve things.
Good talking with you; I look forward to your response. (I'm terribly busy the next few days, so I apologize to you in advance if I do not respond to your reply immediately or even in short order.)