Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by takyon on Saturday June 04 2016, @03:30PM   Printer-friendly
from the just-a-thought dept.

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Runaway1956 on Saturday June 04 2016, @03:44PM

    by Runaway1956 (2926) Subscriber Badge on Saturday June 04 2016, @03:44PM (#355116) Journal

    How does Calibre fetch news? http://manual.calibre-ebook.com/news.html [calibre-ebook.com] I have no idea if Calibre is in a speaking relationship with Perl though.

    • (Score: 4, Insightful) by frojack on Saturday June 04 2016, @04:38PM

      by frojack (1554) on Saturday June 04 2016, @04:38PM (#355143) Journal

      I ran Calibre fetch news app for quite a while, but transfering the news to my e-reader was never worth the effort.
      It is rather in-discriminant and is designed to scrape the entire site rather than just an article, but the scripts and source are all available.

      But here we are, with the IRC channel tail wagging the website Dog, something I warned about when the IRC channel was first talked about.
      If people think so little of a story that a one-liner on IRC is all they can muster, maybe we don't need that story.

      --
      No, you are mistaken. I've always had this sig.
  • (Score: 5, Insightful) by frojack on Saturday June 04 2016, @03:46PM

    by frojack (1554) on Saturday June 04 2016, @03:46PM (#355118) Journal

    Better to get people to submit stories than publicly beg for an automated plagiarism machine.

    If people don't submit on weekends, its probably because people don't read the site on weekends, and wouldn't notice it not being updated as often.

    --
    No, you are mistaken. I've always had this sig.
    • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:35PM

      by Anonymous Coward on Saturday June 04 2016, @05:35PM (#355168)

      Dats rite! Let dem trolls comment on teh submitted stories til som fool submit mor!

    • (Score: 1) by idetuxs on Saturday June 04 2016, @06:40PM

      by idetuxs (2990) on Saturday June 04 2016, @06:40PM (#355196)

      Better to get people to submit stories than publicly beg for an automated plagiarism machine.

      How is that different than doing it mannually?

      • (Score: 3, Informative) by frojack on Saturday June 04 2016, @08:00PM

        by frojack (1554) on Saturday June 04 2016, @08:00PM (#355229) Journal

        No original content.
        Scrape of entire articles not just a fair use paragraph.
        No multi-source stories.

        --
        No, you are mistaken. I've always had this sig.
        • (Score: 2) by Runaway1956 on Sunday June 05 2016, @02:21AM

          by Runaway1956 (2926) Subscriber Badge on Sunday June 05 2016, @02:21AM (#355373) Journal

          I've done it the "hard way", and I've done it with the IRC submission bot.

          I'll admit, I do feel just a little guilty doing it the "easy way". ~submit url - that's so fast and easy, any braindead idiot can do it. But, sometimes, I'm getting ready for work, stumble over something I deem worthy of submission, and I simply don't have TIME.

          Besides which - consensus seems to be that the submitter shouldn't add his own remarks, opinions, or observations, so the only advantage of manual submissions is that you "can" list additional sources. (I do that almost automatically when the source is Faux Noise, LOL)

          • (Score: 3, Insightful) by frojack on Sunday June 05 2016, @07:36AM

            by frojack (1554) on Sunday June 05 2016, @07:36AM (#355431) Journal

            you "can" list additional sources.

            Yeah, I noticed butthurt has made him that his personal signature.
            Post as many sources as he can, but pick the twistedest version for his write-up.

            I actually have no problem with submitters including comments and their own analysis. As long as they make it clear that is what they are doing.

            --
            No, you are mistaken. I've always had this sig.
    • (Score: 2) by martyb on Sunday June 05 2016, @01:15PM

      by martyb (76) Subscriber Badge on Sunday June 05 2016, @01:15PM (#355502) Journal

      Better to get people to submit stories than publicly beg for an automated plagiarism machine.

      I see from your history that you regularly submit stories (about 60 over the past year!), and even more frequently, submit comments — please accept my genuine thanks for your active participation!

      As for "automated plagiarism machine", I sense a misunderstanding. This was an effort to obtain, within the initial submission, a form of a linked story that could be used as a basis for creating a story for SoylentNews. In no place was it stated that we would publish that as-is. Editors, well, edit. But, it is much easier to edit when the whole story is presented, with a consistent, filtered input. Select-all, copy, paste loses both links (e.g. "Click here to see our earlier coverage" and "here" does not come through as a link) and formatting (e.g. italics that identify titles of publications, like The New York Times.) Maybe there is a better, easier way?

      What is difficult about submitting stories? Say, you see an interesting story on the web and think it might be interesting to the community. Why not submit it? Too much effort involved? How can we make it easier? That is the nature of the problem that this story was attempting to explore... an automated mechanism to make things easier.

      If people don't submit on weekends, its probably because people don't read the site on weekends, and wouldn't notice it not being updated as often.

      Non Sequitur? In my experience, one does not necessarily imply the other. Plenty of people read the site without submitting on both weekdays AND weekends... I'd hazard a guess that the majority of the community falls into that category.

      It is true, however, that the number of submissions for the weekend does take a nosedive. Could just be a matter of, say, half as many people on the site results in half as many submissions (pulling numbers out of my posterior).

      On the other hand, there is no shortage of people reading SoylentNews on weekends. Ignoring ACs, this story alone has been viewed in excess of 1000 times. As a rule of thumb, we have found that hits on our load balancer suggest that actual story readership is on the order of ten times that amount.

      The more general problem, as I see it, is what hinders people from making submissions in the first place? What is difficult about the process? What tools are available to make it easier? This story pursued just one possible approach. I repeat the question posed in the original story:

      Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

      I've seen many helpful response already and will be looking into them. Please accept my humble thanks for all of the suggestions! I sense that I have identified a possible mechanism for a solution to a part of the problem, and should be looking at it from a higher level of abstraction. Maybe what would be more useful is a browser plugin that one could activate and which would guide the submission process from within a web-page-of-interest? Please keep the ideas coming!

      --
      Wit is intellect, dancing.
      • (Score: 2) by frojack on Sunday June 05 2016, @05:08PM

        by frojack (1554) on Sunday June 05 2016, @05:08PM (#355540) Journal

        I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.

        (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)

        I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.

        And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.

        --
        No, you are mistaken. I've always had this sig.
        • (Score: 2) by martyb on Monday June 06 2016, @03:40AM

          by martyb (76) Subscriber Badge on Monday June 06 2016, @03:40AM (#355711) Journal

          Thanks for the reasoned and civil response to my reply. I'll continue in that vein and reply only to that which you replied.

          I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.

          For the record, to my knowledge, there are only three: "Arthur T Knackerbracket", "exec", and "MrPlow". Based on the scraping tech taken from "Arthur", a new one is in the works which is called "x" at the moment.

          That is one of the problems I regularly face as an editor. If the submission *seems* to hold together on its own, I succumb to efficiency instead of exhaustive examination, and generally accept it without too much rework, but it certainly depends on the submitter. For example, there is one regular submitter whose stories generally drop italics. Oh, a story from 'foo'? Need to check on x, y, and z.

          I'm gaining a sense of which submitters are prone to submitting a too-short story. Then the challenge comes in finding out what they DID submit, and what was omitted, and putting it back together. Stricly from an editing standpoint, I'd take the too-much over the too-little submitted, with the caveat that if it is substantially the whole article, that the submitter make that known.

          That was, in part, the motivation for original story; I find it easier to cut than to extend.

          (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)

          I argue with myself, too. I hope you have more success than I do! ;-) In some cases, it is exceedingly difficult to condense the story down any further than the original story. In those cases, I just muddle though as best I can and leave it to the curious to read the rest of the story. In other cases, I've most likely missed a salient paragraph (or two). It happens. I'd estimate about half of my editing is done after 10pm, and that is usually after a full day at work.

          I generally try to take a light tough in my editing. In short, if it passes a "sniff test", i.e. the story holds together for the most part, is not too terribly biased, and covers something that I sense would be of interest to the community, it'll probably make it out pretty much as received. The others, well, time and energy permitting, I'll give a go at those, too. I just make sure to cinch up my belt, take a deep breath, and dive in.

          It is generally not my intention to try and force reading of TFA, though I have no doubts it has happened on more than a few occasions.

          I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.

          I have a similar view on this. Include the links to whatever was provided in the submission. If here is another "angle" that was published elsewhere, feel free to mention it, but it would be helpful to include why it was deemed interesting beyond what was already submitted.

          And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.

          Why do you think I'd be surprised? That has happened to me countless times on my submissions, too. Further, I've encountered the same thing when editing a story. Spend a bunch of time formatting it, reviewing the sources, and then find something that just doesn't jive.

          In some cases, there are enough stories in the submission queue that I can just ignore this one and try a different story. In other cases, the empty story is yelling at me. Then, as much work as it will be, I find I need to dive in and just try to make a silk purse out of half a sow's ear.

          Going full circle, two of the bots submit little more than a URL, a title, and maybe a couple/few sentences at best. I'd sooner see the whole story be submitted, along with a caution that it includes the entire story text. That would be a flag to me that I cannot run the entire submission as a story and that I need to pare it down before pushing it out to the story queue. And that was a big part of the motivation for the original story's request -- trying to avoid a too-succinct story submission.

          Okay, it is well nigh midnight and I am struggling to continue. I hope this response better explains my motivation and what I envisioned as a possible path that would improve things.

          Good talking with you; I look forward to your response. (I'm terribly busy the next few days, so I apologize to you in advance if I do not respond to your reply immediately or even in short order.)

  • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @03:50PM

    by Anonymous Coward on Saturday June 04 2016, @03:50PM (#355122)

    Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

    I use noscript set to block javascript by default and when I enable javascript I always use the "temporary for this page" setting. So if a site needs javascript I always have to reenable javascript manually each time I go there so I am super cognizant of when a site needs javascript.

    My experience is that most news sites work better without javascript. It gets past a lot of paywalls too. The formatting can be ugly but firefox's "reader mode" fixes the ugliness nearly every time (and reader mode does not phone home or anything else sneaky).

    • (Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @05:13PM

      by Anonymous Coward on Saturday June 04 2016, @05:13PM (#355159)

      I use noscript

      +1000 Trendy Pseudo-Geek

      Congratulations for being a follower!

      • (Score: -1, Troll) by Anonymous Coward on Saturday June 04 2016, @05:25PM

        by Anonymous Coward on Saturday June 04 2016, @05:25PM (#355164)

        Shattap ya goddamn nigga-jew-boy.

        • (Score: -1, Redundant) by Anonymous Coward on Saturday June 04 2016, @05:41PM

          by Anonymous Coward on Saturday June 04 2016, @05:41PM (#355170)

          Yeah that's right, and my dick's so big that when my black mammy had me circumcised she saved the skin and now I carry my foreskin around with me on a chain. I'ma beat yer ass with my massive foreskin, white goy!

      • (Score: 2) by Runaway1956 on Sunday June 05 2016, @02:27AM

        by Runaway1956 (2926) Subscriber Badge on Sunday June 05 2016, @02:27AM (#355375) Journal

        Trendy has nothing to do with anything. I can't speak for anyone else, but I have limited bandwidth, shared among two people all the time, and as many as 8 people sometimes. Advertising can and will hog 75 to 95% of my bandwidth. How in hell can I browse the web, if advertising is consuming the lion's share of my 1 1/2 to 2 MB of bandwidth? I block everything - web fonts, scripts, cross-site scripting, known ad servers - then it is possible to share that limited bandwidth amont however many of us are online.

        That doesn't even begin to describe how much I detest the practices of commercial interests on the internet. If I had unlimited bandwidth, I would still block all the shite that I block now.

        Trendy? Hardly. The masses are consuming the commercial shite just as fast as the corporations can shovel it to them. Witness Facebook and the like.

  • (Score: 5, Insightful) by bradley13 on Saturday June 04 2016, @03:51PM

    by bradley13 (3053) on Saturday June 04 2016, @03:51PM (#355123) Homepage Journal

    ...many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page

    Sites like this deserve to die in obscurity. If the basic content isn't present without JavaScript, look elsewhere. Has the advantage of making your life simpler, as well...

    --
    Everyone is somebody else's weirdo.
    • (Score: 2, Insightful) by Anonymous Coward on Saturday June 04 2016, @04:25PM

      by Anonymous Coward on Saturday June 04 2016, @04:25PM (#355133)

      I agree with this. If a website is so badly written it can't even display text without client-side scripting we're better off without that website.

  • (Score: 2, Interesting) by isj on Saturday June 04 2016, @03:56PM

    by isj (5249) on Saturday June 04 2016, @03:56PM (#355125) Homepage

    As far as I know google actually runs the page in a small VM (probably vm8 with a firefox/chrome like environment), and when the page content has settled they index that. They earlier proposed a scheme to craw ajax pages: https://developers.google.com/webmasters/ajax-crawling/docs/learn-more#an-agreement-between-crawler-and-server [google.com] . However, it gets a bit more weird: https://www.searchenginejournal.com/google-backs-down-from-proposal-to-make-ajax-pages-crawlable/143291/ [searchenginejournal.com] where they instead recommend "progressive enhancement"

    My hypothesis is that google made their vm-emulation good enough and by deprecating the crawleable scheme clueless webmasters will force all crawlers to waste CPU and power on executing javascript just to get to the information. A waste of CPU cycles.

    • (Score: 1, Informative) by Anonymous Coward on Saturday June 04 2016, @04:18PM

      by Anonymous Coward on Saturday June 04 2016, @04:18PM (#355131)

      The same that youtube deprecating the anonymous api to track people forces to scrape the pages and download 100's of kB wasting also cpu parsing them to just get the list of the latest videos for each channel.

  • (Score: 2) by opinionated_science on Saturday June 04 2016, @04:12PM

    by opinionated_science (4031) on Saturday June 04 2016, @04:12PM (#355127)

    There's a reader button in firefox - cleans up a lot for reading purposes.

    Maybe this helps?

  • (Score: 3, Interesting) by canopic jug on Saturday June 04 2016, @04:18PM

    by canopic jug (3949) Subscriber Badge on Saturday June 04 2016, @04:18PM (#355130) Journal

    You probably know about LWP to fetch the pages. To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath I've recently used both with wget to extract thousands of pages from a defunct CMS and restructure the output in to static HTML designed for human maintenance and convert some of the more obnoxious attempts at formatting to CSS and many chunks to SSI. It will work easily for screen scraping and will be easy to update when the target changes its layout. You might want to pipe the fetched pages through Tidy first to make sure they are valid HTML or at least well-formed XHTML.

    There is a corresponding XML module, XML::TreeBuilder, that functions in much the same way if you need to process XML feeds.

    --
    Money is not free speech. Elections should not be auctions.
    • (Score: 2) by fliptop on Saturday June 04 2016, @07:30PM

      by fliptop (1666) on Saturday June 04 2016, @07:30PM (#355221) Journal

      To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath

      HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.

      --
      Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.
      • (Score: 2) by canopic jug on Saturday June 04 2016, @08:28PM

        by canopic jug (3949) Subscriber Badge on Saturday June 04 2016, @08:28PM (#355256) Journal

        HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.

        Yes, I looked at HTML::SimpleParse and HTML::TokeParser first. For the task I had, I quickly needed something more. With HTML::TreeBuilder you can make fairly complex selections and even replacements.

        But reading all of the summary again, I see that the problem is javascript, not so much the HTML parsing. I generally don't waste my time on sites that are so heavy with javascript that they won't work with it turned off. Those kinds of sites probably should be encouraged to die. If SN wants to keep those kinds of sites alive then maybe JavaScript::SpiderMonkey and / or WWW::Mechanize::Firefox could work to parse the garbage pages before processing. However, those scripts and any dependent applications really need to be severely chrooted and locked down.

        --
        Money is not free speech. Elections should not be auctions.
  • (Score: 4, Interesting) by Beige on Saturday June 04 2016, @04:29PM

    by Beige (3989) on Saturday June 04 2016, @04:29PM (#355137) Homepage

    Render the page through a command line renderer like wkhtmltopdf ( http://wkhtmltopdf.org/ [wkhtmltopdf.org] ) and then convert the pdf to txt.. Should be pretty easy to then write a perl etc func which checks for paragraphs of 60+ words (i.e. skip any menus, links etc).

    • (Score: 2) by fishybell on Saturday June 04 2016, @05:24PM

      by fishybell (3156) on Saturday June 04 2016, @05:24PM (#355163)

      +1 for this. I used it for a project at work where it was easier to render something with html/css than with Tcl/Tk [www.tcl.tk] (what the app requiring the rendering was written in). I ran into the problem of platform compatibility with our version of Linux, but was using an older version of the software from its days on Google Code.

  • (Score: 3, Interesting) by Non Sequor on Saturday June 04 2016, @04:46PM

    by Non Sequor (1005) on Saturday June 04 2016, @04:46PM (#355145) Journal

    First, the tool you need is a headless browser. These things exist, and they give you a console or API based way to load a page, including anything accomplished by JavaScript, and then allows you to get a DOM tree for the page elements. Some headless browsers even go as far as essentially doing all of the layout for graphical display and will allow you to generate screenshots or query DOM attributes related to display.

    Now, once you're comfortable using a headless browser, the next step for what you want to do is to come up with a heuristic for grabbing an appropriate chunk of article text. I'm thinking a starting point might be to get a list of all of the document nodes with inner text and sort it by word count, possibly weighted by some DOM attribute that's indicative of the final rendered size of the text. From here, I think you can get a heuristic to identify the deepest node in the tree that contains the article body text: traverse the elements of the DOM tree using the parent relation starting form these listed elements, identify common parents, and stop when you have identified a node which is parent to x% of the [display size weighted] inner text.

    Why do we want the deepest one? Because we want to avoid getting text from navigation bars, ads and comments sections. In general, all of those things should be made up of small chunks of text and seeding the search from the largest chunks of text will bias it towards the article section (plus weighting for display size may seal the deal).

    Once you've identified that, either convert it all down to text and let the human editor pare it down, or take all of the text up to and including the first paragraph that is greater than n sentences long (for the sake of getting a decent amount of text from articles which start with a cutesy sequence of short paragraphs).

    --
    Write your congressman. Tell him he sucks.
    • (Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM

      by hendrikboom (1125) Subscriber Badge on Saturday June 04 2016, @10:06PM (#355286) Homepage Journal

      I would very much like to know more about headless browsers. I dind't know there were any. Care to provide links or other information about them? Or about readily available components from which to build one's own?

      -- hendrik

      • (Score: 2) by isostatic on Saturday June 04 2016, @10:32PM

        by isostatic (365) on Saturday June 04 2016, @10:32PM (#355299) Journal

        PhantomJS is one, full webkit blown browser with javascript engine, css 3 support, etc etc.

      • (Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM

        by Non Sequor (1005) on Saturday June 04 2016, @10:33PM (#355300) Journal

        There's a proliferation of headless browsers right now based around different Javascript engines. Here's a StackOverflow page with a ton of them: http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions. [stackoverflow.com] I've used PhantomJS, primarily as just a JavaScript REPL, but I know it has a lot of features.

        I've also used WWW::Mechanize with Perl and it's easy to use. The only thing is, I think at this point it's a little primitive compared to the other options. Conceptually, I think it's a bit more lynx-like.

        Selenium is probably the most extreme level of engineering invested in a headless browser. It has multiple language frameworks and it works with multiple browsers. I found it harder to get started with.

        --
        Write your congressman. Tell him he sucks.
  • (Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @04:48PM

    by Anonymous Coward on Saturday June 04 2016, @04:48PM (#355146)
  • (Score: 2) by JoeMerchant on Saturday June 04 2016, @05:05PM

    by JoeMerchant (3937) on Saturday June 04 2016, @05:05PM (#355152)

    If the page is graphically rendered, you can feed the image to OCR. The problem then is blocking ads and extraneous content... The approach is not without problems, but has the advantage that: if you can read the page, your scraper can read the page.

    --
    🌻🌻 [google.com]
  • (Score: 3, Interesting) by Thexalon on Saturday June 04 2016, @05:32PM

    by Thexalon (636) on Saturday June 04 2016, @05:32PM (#355167)

    How about this: A "firehose"-like thing that allows a human volunteer to take that link, read the article, summarize it if it's any good, and turn it into a submission?

    --
    The only thing that stops a bad guy with a compiler is a good guy with a compiler.
    • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:47PM

      by Anonymous Coward on Saturday June 04 2016, @05:47PM (#355172)

      If you're bored, why not just read the submissions in the queue?

  • (Score: 3, Insightful) by Username on Saturday June 04 2016, @05:47PM

    by Username (4557) on Saturday June 04 2016, @05:47PM (#355173)

    I haven’t used perl since the year 2000ish, but there should be some kind of inet or wget function or library you can use to get the page. You can fool most javasbullscript websites into html mode by using a mobile browser useragent. If it doesn’t work just delete the submission or website from scrape list. Fuck em.

    I would do it the template way. Most news site do not change their templates very often, and if they do it would be a trivial matter of update the scraping script to look for id="newstitle" instead of class="titlenews". Could even have some web page with a bunch of input boxes which content is read and written from some config file that stores the lookup terms for the script to use so moderators can just update it without having to update the script. If the term isnt found the script can just can fall back to other methods to find the content but you’ll be likely to get a bunch of garbage. I’d make this auto generated article a submission itself that has to be approved and not automatically shown on the page just in case it does fuck up.

    • (Score: 2) by Username on Saturday June 04 2016, @06:12PM

      by Username (4557) on Saturday June 04 2016, @06:12PM (#355187)

      PS: Also I’d compare the articles on the scrape list to other articles that were scraped at the same time and link them together. I’d probably get the first 2000 characters of the article, remove common words like "of|the|a|it" etc turn all spaces and multi spaces into one char 32 space. Then load the remaining words into an array using that space as the delimiter and find each word in other articles arrays and use the amount of common words to determine relevancy. Or maybe just find some keyword from a keyword list and tag the article with it.

      Something like that.

  • (Score: 2) by mcgrew on Saturday June 04 2016, @06:07PM

    by mcgrew (701) <publish@mcgrewbooks.com> on Saturday June 04 2016, @06:07PM (#355182) Homepage Journal

    I haven't had the need for a scraper, but coincidentally I ran across this site [htmlgoodies.com] yesterday when I was looking for a solution to a different problem, and it had Fetch Hyperlinked Files using Jsoup [htmlgoodies.com]. I never heard of Jsoup before and have no idea if it would work for you. Good Luck!

    --
    mcgrewbooks.com mcgrew.info nooze.org
  • (Score: 5, Insightful) by Anonymous Coward on Saturday June 04 2016, @06:39PM

    by Anonymous Coward on Saturday June 04 2016, @06:39PM (#355195)

    Don't do it!

    1) If the article isn't good enough for an editor to vet and read, why is it good enough to waste every soylentil's time on?

    2) Scraping JS means soylentils would need to enable JS. Have you /spoken/ to us? When was the last time one of us /wasn't/ running noscript-or-equivalent??

    • (Score: 2) by urza9814 on Tuesday June 07 2016, @11:59PM

      by urza9814 (3954) on Tuesday June 07 2016, @11:59PM (#356662) Journal

      1) If the article isn't good enough for an editor to vet and read, why is it good enough to waste every soylentil's time on?

      Seems that they don't need editors, they need article submissions. That's what this is intended to solve. The articles will still be edited as usual, this will just ensure that the editors have more than just a bare link to start with.

      2) Scraping JS means soylentils would need to enable JS. Have you /spoken/ to us? When was the last time one of us /wasn't/ running noscript-or-equivalent??

      The idea is for the SN servers to run the JavaScript so you don't have to.

  • (Score: 2) by goodie on Saturday June 04 2016, @07:03PM

    by goodie (1877) on Saturday June 04 2016, @07:03PM (#355212) Journal

    Not sure that would work for you, but Beautiful Soup (python) was made for this type thing I reckon. I've used it to clean up messy HTML etc. to retrieve only text contents from within a webpage, and it was easy to use and very good at it (for my purpose).

    https://www.crummy.com/software/BeautifulSoup/ [crummy.com]

    • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @07:11PM

      by Anonymous Coward on Saturday June 04 2016, @07:11PM (#355214)

      BS works until you hit heavy JS or Flash sites then not so much.

      Source: I write scrappers from time to time with BS/Python for my conky info windows.

  • (Score: 2) by AudioGuy on Saturday June 04 2016, @07:04PM

    by AudioGuy (24) on Saturday June 04 2016, @07:04PM (#355213) Journal

    yourshell # lynx -dump http://soylentnews.org/ [soylentnews.org]

    (dumps to your screen plain text and a list of links at bottom)

    Its reasonably nicely formatted for parsing.

    Your grep skills will be improved. :-)

  • (Score: 2) by jdavidb on Saturday June 04 2016, @07:39PM

    by jdavidb (5690) on Saturday June 04 2016, @07:39PM (#355225) Homepage Journal
    The correct way to scrape in Perl is the awesome WWW::Mechanize by the fantastic Andy Lester [github.com], author of ack [beyondgrep.com].
    --
    ⓋⒶ☮✝🕊 Secession is the right of all sentient beings
  • (Score: 2, Informative) by DonkeyChan on Saturday June 04 2016, @08:18PM

    by DonkeyChan (5551) on Saturday June 04 2016, @08:18PM (#355246)

    http://www.seleniumhq.org/ [seleniumhq.org]
    I use selenium for all my unit tests an it's been far easier to implement than WWW::Mechanize or other solutions.

  • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @08:22PM

    by Anonymous Coward on Saturday June 04 2016, @08:22PM (#355248)

    firefox plugin: submit for consideration ... abuse ... ooops

  • (Score: 3, Informative) by number6 on Saturday June 04 2016, @09:43PM

    by number6 (1831) on Saturday June 04 2016, @09:43PM (#355282) Journal

    Web::Scraper
    http://search.cpan.org/dist/Web-Scraper/ [cpan.org]
    Perl module, Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions, by Tatsuhiko Miyagawa.
    * A blog article............http://perlgerl.wordpress.com/2011/02/27/219/
    * A comment...............Web::Scraper is so effin awesome. You just yank up some LWP from disk,
        add some scraper to it and whambo! you have yourself some local data that the web universe
        used to call it's own. Follow that author on CPAN; he's a genius.

     
    How to fetch urls? I want to build a Perl web scraper like Python's urllib
    https://www.reddit.com/r/perl/comments/3oiwqn/how_to_fetch_urls_like_in_pythons_urllib/ [reddit.com]

     
    Web Scraping with Perl & PhantomJS (headless WebKit browser)
    article, Feb 2013, by Rob Hammond (Perl programmer)
    http://blogs.perl.org/users/robhammond/2013/02/web-scraping-with-perl-phantomjs.html [perl.org]

     
    Perl Proxy Scraper
    https://phl4nk.wordpress.com/2015/04/11/perl-proxy-scraper/ [wordpress.com]
    * A comment...............Neat! I toyed around with a one-liner that does about the same thing for standard output:
        perl -Mojo -E 'say g($_)->body =~ /((?:\d{1,3}\.?){4}:\d{1,4})/g for @ARGV' $url1 $url2 ...

     
    how-to-develop-a-good-scraper-on-perl - Google Search [google.com]

     

  • (Score: 0) by Anonymous Coward on Saturday June 04 2016, @10:29PM

    by Anonymous Coward on Saturday June 04 2016, @10:29PM (#355298)

    BlackWidow
    by http://SoftByteLabs.com/ [softbytelabs.com]

    Published June 11, 2004

    This is a multi-function Internet tool. It is an offline browser, Web site scanner, a site mapping tool, a site ripper, a site mirroring tool and a FTP scanner. Use it to scan a site and create a complete profile of the site's structure, files, external links and link errors. Then use it to download the web site with its structure and files intact, to use as a site mirror or to be converted by BlackWidow into a locally linked site for off-line browsing and long-term reference. Or use it to scan for and download any selection of files in part of a site or in a group of sites.
    BlackWidow will scan HTTP sites, SSL sites and FTP sites. You can access password-protected sites, use threads, pull links from Java Scripts and Java Scripts file, and resume broken downloads. View, edit and print the structure of a Web site, write your own plug-ins and automatically load your plug-ins. Will also scan Adobe Acrobat files for links.

    https://archive.org/details/tucows_193802_BlackWidow [archive.org]

  • (Score: 4, Informative) by isostatic on Saturday June 04 2016, @10:37PM

    by isostatic (365) on Saturday June 04 2016, @10:37PM (#355301) Journal

    Good evening. Today is Good Friday. There is no news

    If there's nothing to report, don't report it. Dont go for an 18 minute news program consisting of 15 mintues of "old sportsman has died" and a couple of minutes of "politicians lie" as happened with the BBC this evening.

  • (Score: 0) by Anonymous Coward on Sunday June 05 2016, @12:33AM

    by Anonymous Coward on Sunday June 05 2016, @12:33AM (#355331)

    Does the site have an RSS feed? That's a whole lot easier to parse.

  • (Score: 3, Informative) by Hairyfeet on Sunday June 05 2016, @02:55AM

    by Hairyfeet (75) <bassbeast1968NO@SPAMgmail.com> on Sunday June 05 2016, @02:55AM (#355378) Journal

    But there is an easy way to fix the "queue runs dry" issue which is where I've gotten every article I've posted here...Daily Rotation [dailyrotation.com] which is nothing but science and tech, VERY nerdy and right up this site's alley. As just a couple examples, just me grabbing the latest headlines...a missing features installer for XP [betanews.com] from the guy that made the MFI for Win 10, FBI says TOR child porn exploit wasn't malware [arstechnica.com] and Japan's battleship island [cnet.com] with some great pictures.

    So if the site starts to run dry? just head over to Rotation and grab the headlines, its all the nerdy kind of stuff the old site USED to cover.

    --
    ACs are never seen so don't bother. Always ready to show SJWs for the racists they are.
  • (Score: 2) by Fnord666 on Monday June 06 2016, @01:26AM

    by Fnord666 (652) on Monday June 06 2016, @01:26AM (#355658) Homepage
    Have you considered contacting janrinok about the StoryBot software that he uses to submit stories?
  • (Score: 0) by Anonymous Coward on Monday June 06 2016, @03:36PM

    by Anonymous Coward on Monday June 06 2016, @03:36PM (#355937)

    ...use a headless browser like Phantom JS (http://phantomjs.org/ [phantomjs.org]) to pull the desired content.

  • (Score: 0) by Anonymous Coward on Monday June 06 2016, @10:20PM

    by Anonymous Coward on Monday June 06 2016, @10:20PM (#356151)

    Pages that require Javascript are a plain, and I can't answer that problem.

    But if you can reach of point of getting the page content into an html file, I can say that I had good success using the following to scrape content: I'd use tidy (https://www.google.com/#q=tidy+xhtml) to convert the html into XML parsable XHTML. Then I'd use XSLT to mine the content I wanted out of the XHTML. The advantage to this was the XSLT was fully programmable without having to recompile any code, so when sites had minor page layout changes, usually the tidy+XSLT continued working without noticing.