Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by takyon on Saturday June 04 2016, @03:30PM   Printer-friendly
from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by martyb on Sunday June 05 2016, @01:15PM

    by martyb (76) on Sunday June 05 2016, @01:15PM (#355502) Journal

    Better to get people to submit stories than publicly beg for an automated plagiarism machine.

    I see from your history that you regularly submit stories (about 60 over the past year!), and even more frequently, submit comments — please accept my genuine thanks for your active participation!

    As for "automated plagiarism machine", I sense a misunderstanding. This was an effort to obtain, within the initial submission, a form of a linked story that could be used as a basis for creating a story for SoylentNews. In no place was it stated that we would publish that as-is. Editors, well, edit. But, it is much easier to edit when the whole story is presented, with a consistent, filtered input. Select-all, copy, paste loses both links (e.g. "Click here to see our earlier coverage" and "here" does not come through as a link) and formatting (e.g. italics that identify titles of publications, like The New York Times.) Maybe there is a better, easier way?

    What is difficult about submitting stories? Say, you see an interesting story on the web and think it might be interesting to the community. Why not submit it? Too much effort involved? How can we make it easier? That is the nature of the problem that this story was attempting to explore... an automated mechanism to make things easier.

    If people don't submit on weekends, its probably because people don't read the site on weekends, and wouldn't notice it not being updated as often.

    Non Sequitur? In my experience, one does not necessarily imply the other. Plenty of people read the site without submitting on both weekdays AND weekends... I'd hazard a guess that the majority of the community falls into that category.

    It is true, however, that the number of submissions for the weekend does take a nosedive. Could just be a matter of, say, half as many people on the site results in half as many submissions (pulling numbers out of my posterior).

    On the other hand, there is no shortage of people reading SoylentNews on weekends. Ignoring ACs, this story alone has been viewed in excess of 1000 times. As a rule of thumb, we have found that hits on our load balancer suggest that actual story readership is on the order of ten times that amount.

    The more general problem, as I see it, is what hinders people from making submissions in the first place? What is difficult about the process? What tools are available to make it easier? This story pursued just one possible approach. I repeat the question posed in the original story:

    Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

    I've seen many helpful response already and will be looking into them. Please accept my humble thanks for all of the suggestions! I sense that I have identified a possible mechanism for a solution to a part of the problem, and should be looking at it from a higher level of abstraction. Maybe what would be more useful is a browser plugin that one could activate and which would guide the submission process from within a web-page-of-interest? Please keep the ideas coming!

    --
    Wit is intellect, dancing. I'm too old to act my age. Life is too important to take myself seriously.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by frojack on Sunday June 05 2016, @05:08PM

    by frojack (1554) on Sunday June 05 2016, @05:08PM (#355540) Journal

    I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.

    (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)

    I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.

    And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.

    --
    No, you are mistaken. I've always had this sig.
    • (Score: 2) by martyb on Monday June 06 2016, @03:40AM

      by martyb (76) on Monday June 06 2016, @03:40AM (#355711) Journal

      Thanks for the reasoned and civil response to my reply. I'll continue in that vein and reply only to that which you replied.

      I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.

      For the record, to my knowledge, there are only three: "Arthur T Knackerbracket", "exec", and "MrPlow". Based on the scraping tech taken from "Arthur", a new one is in the works which is called "x" at the moment.

      That is one of the problems I regularly face as an editor. If the submission *seems* to hold together on its own, I succumb to efficiency instead of exhaustive examination, and generally accept it without too much rework, but it certainly depends on the submitter. For example, there is one regular submitter whose stories generally drop italics. Oh, a story from 'foo'? Need to check on x, y, and z.

      I'm gaining a sense of which submitters are prone to submitting a too-short story. Then the challenge comes in finding out what they DID submit, and what was omitted, and putting it back together. Stricly from an editing standpoint, I'd take the too-much over the too-little submitted, with the caveat that if it is substantially the whole article, that the submitter make that known.

      That was, in part, the motivation for original story; I find it easier to cut than to extend.

      (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)

      I argue with myself, too. I hope you have more success than I do! ;-) In some cases, it is exceedingly difficult to condense the story down any further than the original story. In those cases, I just muddle though as best I can and leave it to the curious to read the rest of the story. In other cases, I've most likely missed a salient paragraph (or two). It happens. I'd estimate about half of my editing is done after 10pm, and that is usually after a full day at work.

      I generally try to take a light tough in my editing. In short, if it passes a "sniff test", i.e. the story holds together for the most part, is not too terribly biased, and covers something that I sense would be of interest to the community, it'll probably make it out pretty much as received. The others, well, time and energy permitting, I'll give a go at those, too. I just make sure to cinch up my belt, take a deep breath, and dive in.

      It is generally not my intention to try and force reading of TFA, though I have no doubts it has happened on more than a few occasions.

      I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.

      I have a similar view on this. Include the links to whatever was provided in the submission. If here is another "angle" that was published elsewhere, feel free to mention it, but it would be helpful to include why it was deemed interesting beyond what was already submitted.

      And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.

      Why do you think I'd be surprised? That has happened to me countless times on my submissions, too. Further, I've encountered the same thing when editing a story. Spend a bunch of time formatting it, reviewing the sources, and then find something that just doesn't jive.

      In some cases, there are enough stories in the submission queue that I can just ignore this one and try a different story. In other cases, the empty story is yelling at me. Then, as much work as it will be, I find I need to dive in and just try to make a silk purse out of half a sow's ear.

      Going full circle, two of the bots submit little more than a URL, a title, and maybe a couple/few sentences at best. I'd sooner see the whole story be submitted, along with a caution that it includes the entire story text. That would be a flag to me that I cannot run the entire submission as a story and that I need to pare it down before pushing it out to the story queue. And that was a big part of the motivation for the original story's request -- trying to avoid a too-succinct story submission.

      Okay, it is well nigh midnight and I am struggling to continue. I hope this response better explains my motivation and what I envisioned as a possible path that would improve things.

      Good talking with you; I look forward to your response. (I'm terribly busy the next few days, so I apologize to you in advance if I do not respond to your reply immediately or even in short order.)