Stories
Slash Boxes
Comments

SoylentNews

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.

Sections

SoylentNews

Log In

Create Account | Retrieve Password

Gift a Subscription

Why Gift

Help SoylentNews — "Scraping" Web Pages

posted by takyon on Saturday June 04 2016, @03:30PM

from the just-a-thought dept.

martyb writes:

Not infrequently, we receive a story submission to SoylentNews that consists of a single URL. Or, a story that contains but a single sentence or paragraph along with the URL. During weekends and holidays, the story submission queue tends to run dry. We have an IRC channel (#rss-bot) that gathers RSS links from around the web. Hmm.

It would be really handy if there were an automated way to "scrape" the contents of that page. In some cases, a simple redirect of the output of a text-based browser like Lynx would do the trick. Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.

There must be a way to do it — search engines like Google and Bing must extract the page text in order to index it. It would be best to have a general-purpose solution; having a custom template for each site is time-consuming to create and maintain (think if the site changes its layout). Our site is powered by Perl, so that would be the obvious preference.

So, fellow Soylentils, what tools and/or techniques have you used? What has worked for you?

Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?

Original Submission

This discussion has been archived. No new comments can be posted.

Help SoylentNews — "Scraping" Web Pages | Log In/Create an Account | Top | 53 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Calibre?Calibre? (Score: 2) by Runaway1956 on Saturday June 04 2016, @03:44PM

by Runaway1956 (2926) on Saturday June 04 2016, @03:44PM (#355116) Journal

How does Calibre fetch news? http://manual.calibre-ebook.com/news.html [calibre-ebook.com] I have no idea if Calibre is in a speaking relationship with Perl though.

--
“I have become friends with many school shooters” - Tampon Tim Walz
- Re:Calibre?(Score: 4, Insightful) by frojack on Saturday June 04 2016, @04:38PM
  
  by frojack (1554) on Saturday June 04 2016, @04:38PM (#355143) Journal
  
  I ran Calibre fetch news app for quite a while, but transfering the news to my e-reader was never worth the effort.
  It is rather in-discriminant and is designed to scrape the entire site rather than just an article, but the scripts and source are all available.
  But here we are, with the IRC channel tail wagging the website Dog, something I warned about when the IRC channel was first talked about.
  If people think so little of a story that a one-liner on IRC is all they can muster, maybe we don't need that story.
  
  --
  No, you are mistaken. I've always had this sig.
  
  Parent
Let it sitLet it sit (Score: 5, Insightful) by frojack on Saturday June 04 2016, @03:46PM

by frojack (1554) on Saturday June 04 2016, @03:46PM (#355118) Journal

Better to get people to submit stories than publicly beg for an automated plagiarism machine.
If people don't submit on weekends, its probably because people don't read the site on weekends, and wouldn't notice it not being updated as often.

--
No, you are mistaken. I've always had this sig.
- Re:Let it sit(Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:35PM
  
  by Anonymous Coward on Saturday June 04 2016, @05:35PM (#355168)
  
  Dats rite! Let dem trolls comment on teh submitted stories til som fool submit mor!
  
  Parent
- Re:Let it sitRe:Let it sit (Score: 1) by idetuxs on Saturday June 04 2016, @06:40PM
  
  by idetuxs (2990) on Saturday June 04 2016, @06:40PM (#355196)
  
  Better to get people to submit stories than publicly beg for an automated plagiarism machine.
  
  How is that different than doing it mannually?
  
  Parent
  - Re:Let it sitRe:Let it sit (Score: 3, Informative) by frojack on Saturday June 04 2016, @08:00PM
    
    by frojack (1554) on Saturday June 04 2016, @08:00PM (#355229) Journal
    
    No original content.
    Scrape of entire articles not just a fair use paragraph.
    No multi-source stories.
    
    --
    No, you are mistaken. I've always had this sig.
    
    Parent
    - Re:Let it sitRe:Let it sit (Score: 2) by Runaway1956 on Sunday June 05 2016, @02:21AM
      
      by Runaway1956 (2926) on Sunday June 05 2016, @02:21AM (#355373) Journal
      
      I've done it the "hard way", and I've done it with the IRC submission bot.
      I'll admit, I do feel just a little guilty doing it the "easy way". ~submit url - that's so fast and easy, any braindead idiot can do it. But, sometimes, I'm getting ready for work, stumble over something I deem worthy of submission, and I simply don't have TIME.
      Besides which - consensus seems to be that the submitter shouldn't add his own remarks, opinions, or observations, so the only advantage of manual submissions is that you "can" list additional sources. (I do that almost automatically when the source is Faux Noise, LOL)
      
      --
      “I have become friends with many school shooters” - Tampon Tim Walz
      
      Parent
      - Re:Let it sit(Score: 3, Insightful) by frojack on Sunday June 05 2016, @07:36AM
        
        by frojack (1554) on Sunday June 05 2016, @07:36AM (#355431) Journal
        
        you "can" list additional sources.
        Yeah, I noticed butthurt has made him that his personal signature.
        Post as many sources as he can, but pick the twistedest version for his write-up.
        I actually have no problem with submitters including comments and their own analysis. As long as they make it clear that is what they are doing.
        
        --
        No, you are mistaken. I've always had this sig.
        
        Parent
- Re:Let it sitRe:Let it sit (Score: 2) by martyb on Sunday June 05 2016, @01:15PM
  
  by martyb (76) on Sunday June 05 2016, @01:15PM (#355502) Journal
  
  Better to get people to submit stories than publicly beg for an automated plagiarism machine.
  
  I see from your history that you regularly submit stories (about 60 over the past year!), and even more frequently, submit comments — please accept my genuine thanks for your active participation!
  As for "automated plagiarism machine", I sense a misunderstanding. This was an effort to obtain, within the initial submission, a form of a linked story that could be used as a basis for creating a story for SoylentNews. In no place was it stated that we would publish that as-is. Editors, well, edit. But, it is much easier to edit when the whole story is presented, with a consistent, filtered input. Select-all, copy, paste loses both links (e.g. "Click here to see our earlier coverage" and "here" does not come through as a link) and formatting (e.g. italics that identify titles of publications, like The New York Times.) Maybe there is a better, easier way?
  What is difficult about submitting stories? Say, you see an interesting story on the web and think it might be interesting to the community. Why not submit it? Too much effort involved? How can we make it easier? That is the nature of the problem that this story was attempting to explore... an automated mechanism to make things easier.
  If people don't submit on weekends, its probably because people don't read the site on weekends, and wouldn't notice it not being updated as often.
  
  Non Sequitur? In my experience, one does not necessarily imply the other. Plenty of people read the site without submitting on both weekdays AND weekends... I'd hazard a guess that the majority of the community falls into that category.
  It is true, however, that the number of submissions for the weekend does take a nosedive. Could just be a matter of, say, half as many people on the site results in half as many submissions (pulling numbers out of my posterior).
  On the other hand, there is no shortage of people reading SoylentNews on weekends. Ignoring ACs, this story alone has been viewed in excess of 1000 times. As a rule of thumb, we have found that hits on our load balancer suggest that actual story readership is on the order of ten times that amount.
  The more general problem, as I see it, is what hinders people from making submissions in the first place? What is difficult about the process? What tools are available to make it easier? This story pursued just one possible approach. I repeat the question posed in the original story:
  Maybe I'm approaching this the wrong way? When all you have is a hammer... what am I missing here? Is there another approach?
  
  I've seen many helpful response already and will be looking into them. Please accept my humble thanks for all of the suggestions! I sense that I have identified a possible mechanism for a solution to a part of the problem, and should be looking at it from a higher level of abstraction. Maybe what would be more useful is a browser plugin that one could activate and which would guide the submission process from within a web-page-of-interest? Please keep the ideas coming!
  
  --
  Wit is intellect, dancing. I'm too old to act my age. Life is too important to take myself seriously.
  
  Parent
  - Re:Let it sitRe:Let it sit (Score: 2) by frojack on Sunday June 05 2016, @05:08PM
    
    by frojack (1554) on Sunday June 05 2016, @05:08PM (#355540) Journal
    
    I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.
    (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)
    I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.
    And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.
    
    --
    No, you are mistaken. I've always had this sig.
    
    Parent
    - Re:Let it sit(Score: 2) by martyb on Monday June 06 2016, @03:40AM
      
      by martyb (76) on Monday June 06 2016, @03:40AM (#355711) Journal
      
      Thanks for the reasoned and civil response to my reply. I'll continue in that vein and reply only to that which you replied.
      I see a fairly large percentage of stories posted by the known bots that consist solely of copy-pasted paragraphs. Not saying that's totally bad, as long as the meat of the article is actually captured in the copy/paste. Too often these bot-submitted articles get posted one paragraph or one sentence too short - forcing TFA chasing by every reader.
      
      For the record, to my knowledge, there are only three: "Arthur T Knackerbracket", "exec", and "MrPlow". Based on the scraping tech taken from "Arthur", a new one is in the works which is called "x" at the moment.
      That is one of the problems I regularly face as an editor. If the submission *seems* to hold together on its own, I succumb to efficiency instead of exhaustive examination, and generally accept it without too much rework, but it certainly depends on the submitter. For example, there is one regular submitter whose stories generally drop italics. Oh, a story from 'foo'? Need to check on x, y, and z.
      I'm gaining a sense of which submitters are prone to submitting a too-short story. Then the challenge comes in finding out what they DID submit, and what was omitted, and putting it back together. Stricly from an editing standpoint, I'd take the too-much over the too-little submitted, with the caveat that if it is substantially the whole article, that the submitter make that known.
      That was, in part, the motivation for original story; I find it easier to cut than to extend.
      (I argue with myself as to whether the idea is to get people to follow links to TFA, or merely have TFA as backup and attribution purposes. Personally I like to excerpt the story, dig out the nuggets so that readers don't really have to read TFA, because I suspect most won't. Rarely will I set up a story to require TFA reading.)
      
      I argue with myself, too. I hope you have more success than I do! ;-) In some cases, it is exceedingly difficult to condense the story down any further than the original story. In those cases, I just muddle though as best I can and leave it to the curious to read the rest of the story. In other cases, I've most likely missed a salient paragraph (or two). It happens. I'd estimate about half of my editing is done after 10pm, and that is usually after a full day at work.
      I generally try to take a light tough in my editing. In short, if it passes a "sniff test", i.e. the story holds together for the most part, is not too terribly biased, and covers something that I sense would be of interest to the community, it'll probably make it out pretty much as received. The others, well, time and energy permitting, I'll give a go at those, too. I just make sure to cinch up my belt, take a deep breath, and dive in.
      It is generally not my intention to try and force reading of TFA, though I have no doubts it has happened on more than a few occasions.
      I feel no compunction to list a ton of different source links, because often these are redundant - if not word for word copies off of the wire services.
      
      I have a similar view on this. Include the links to whatever was provided in the submission. If here is another "angle" that was published elsewhere, feel free to mention it, but it would be helpful to include why it was deemed interesting beyond what was already submitted.
      And you would be surprised how many times I start a submission, but upon digging through the TFA I find it huge load of tripe, and just abandon it. If my bullshit bell is ringing, or the source has less than the best reputation I usually back away from the keyboard so nobody gets hurt. Do I dare submit from RT or Sputnik? Sure some of their articles sound plausible, but the ice is thin and the current strong. Sometimes entire stories are just flamebait.
      
      Why do you think I'd be surprised? That has happened to me countless times on my submissions, too. Further, I've encountered the same thing when editing a story. Spend a bunch of time formatting it, reviewing the sources, and then find something that just doesn't jive.
      In some cases, there are enough stories in the submission queue that I can just ignore this one and try a different story. In other cases, the empty story is yelling at me. Then, as much work as it will be, I find I need to dive in and just try to make a silk purse out of half a sow's ear.
      Going full circle, two of the bots submit little more than a URL, a title, and maybe a couple/few sentences at best. I'd sooner see the whole story be submitted, along with a caution that it includes the entire story text. That would be a flag to me that I cannot run the entire submission as a story and that I need to pare it down before pushing it out to the story queue. And that was a big part of the motivation for the original story's request -- trying to avoid a too-succinct story submission.
      Okay, it is well nigh midnight and I am struggling to continue. I hope this response better explains my motivation and what I envisioned as a possible path that would improve things.
      Good talking with you; I look forward to your response. (I'm terribly busy the next few days, so I apologize to you in advance if I do not respond to your reply immediately or even in short order.)
      
      Parent
Javascript Much Rarer Than ClaimedJavascript Much Rarer Than Claimed (Score: 0) by Anonymous Coward on Saturday June 04 2016, @03:50PM

by Anonymous Coward on Saturday June 04 2016, @03:50PM (#355122)

Unfortunately, all too many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page.
I use noscript set to block javascript by default and when I enable javascript I always use the "temporary for this page" setting. So if a site needs javascript I always have to reenable javascript manually each time I go there so I am super cognizant of when a site needs javascript.
My experience is that most news sites work better without javascript. It gets past a lot of paywalls too. The formatting can be ugly but firefox's "reader mode" fixes the ugliness nearly every time (and reader mode does not phone home or anything else sneaky).
- Re:Javascript Much Rarer Than ClaimedRe:Javascript Much Rarer Than Claimed (Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @05:13PM
  
  by Anonymous Coward on Saturday June 04 2016, @05:13PM (#355159)
  
  I use noscript
  +1000 Trendy Pseudo-Geek
  Congratulations for being a follower!
  
  Parent
  - Comment Below Threshold
    
    Re:Javascript Much Rarer Than ClaimedRe:Javascript Much Rarer Than Claimed (Score: -1, Troll) by Anonymous Coward on Saturday June 04 2016, @05:25PM
    
    by Anonymous Coward on Saturday June 04 2016, @05:25PM (#355164)
    
    Shattap ya goddamn nigga-jew-boy.
    
    Parent
    - Comment Below Threshold
      
      Re:Javascript Much Rarer Than Claimed(Score: -1, Redundant) by Anonymous Coward on Saturday June 04 2016, @05:41PM
      
      by Anonymous Coward on Saturday June 04 2016, @05:41PM (#355170)
      
      Yeah that's right, and my dick's so big that when my black mammy had me circumcised she saved the skin and now I carry my foreskin around with me on a chain. I'ma beat yer ass with my massive foreskin, white goy!
      
      Parent
  - Re:Javascript Much Rarer Than Claimed(Score: 2) by Runaway1956 on Sunday June 05 2016, @02:27AM
    
    by Runaway1956 (2926) on Sunday June 05 2016, @02:27AM (#355375) Journal
    
    Trendy has nothing to do with anything. I can't speak for anyone else, but I have limited bandwidth, shared among two people all the time, and as many as 8 people sometimes. Advertising can and will hog 75 to 95% of my bandwidth. How in hell can I browse the web, if advertising is consuming the lion's share of my 1 1/2 to 2 MB of bandwidth? I block everything - web fonts, scripts, cross-site scripting, known ad servers - then it is possible to share that limited bandwidth amont however many of us are online.
    That doesn't even begin to describe how much I detest the practices of commercial interests on the internet. If I had unlimited bandwidth, I would still block all the shite that I block now.
    Trendy? Hardly. The masses are consuming the commercial shite just as fast as the corporations can shovel it to them. Witness Facebook and the like.
    
    --
    “I have become friends with many school shooters” - Tampon Tim Walz
    
    Parent
Some sites are worth ignoringSome sites are worth ignoring (Score: 5, Insightful) by bradley13 on Saturday June 04 2016, @03:51PM

by bradley13 (3053) on Saturday June 04 2016, @03:51PM (#355123) Homepage Journal

...many sites subscribe to the idea that a web page needs to pull in Javascript and libraries from a myriad of other sites. Failing to do so displays a blank page
Sites like this deserve to die in obscurity. If the basic content isn't present without JavaScript, look elsewhere. Has the advantage of making your life simpler, as well...

--
Everyone is somebody else's weirdo.
- Make the web great again! Lynx 2016(Score: 2, Insightful) by Anonymous Coward on Saturday June 04 2016, @04:25PM
  
  by Anonymous Coward on Saturday June 04 2016, @04:25PM (#355133)
  
  I agree with this. If a website is so badly written it can't even display text without client-side scripting we're better off without that website.
  
  Parent
web2.0 -> javascript requiredweb2.0 -> javascript required (Score: 2, Interesting) by isj on Saturday June 04 2016, @03:56PM

by isj (5249) on Saturday June 04 2016, @03:56PM (#355125) Homepage

As far as I know google actually runs the page in a small VM (probably vm8 with a firefox/chrome like environment), and when the page content has settled they index that. They earlier proposed a scheme to craw ajax pages: https://developers.google.com/webmasters/ajax-crawling/docs/learn-more#an-agreement-between-crawler-and-server [google.com] . However, it gets a bit more weird: https://www.searchenginejournal.com/google-backs-down-from-proposal-to-make-ajax-pages-crawlable/143291/ [searchenginejournal.com] where they instead recommend "progressive enhancement"
My hypothesis is that google made their vm-emulation good enough and by deprecating the crawleable scheme clueless webmasters will force all crawlers to waste CPU and power on executing javascript just to get to the information. A waste of CPU cycles.
- Re:web2.0 -> javascript required(Score: 1, Informative) by Anonymous Coward on Saturday June 04 2016, @04:18PM
  
  by Anonymous Coward on Saturday June 04 2016, @04:18PM (#355131)
  
  The same that youtube deprecating the anonymous api to track people forces to scrape the pages and download 100's of kB wasting also cpu parsing them to just get the list of the latest videos for each channel.
  
  Parent
firefox, reader?(Score: 2) by opinionated_science on Saturday June 04 2016, @04:12PM

by opinionated_science (4031) on Saturday June 04 2016, @04:12PM (#355127)

There's a reader button in firefox - cleans up a lot for reading purposes.
Maybe this helps?
Perl CPAN's HTML::TreeBuilderPerl CPAN's HTML::TreeBuilder (Score: 3, Interesting) by canopic jug on Saturday June 04 2016, @04:18PM

by canopic jug (3949) on Saturday June 04 2016, @04:18PM (#355130) Journal

You probably know about LWP to fetch the pages. To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath I've recently used both with wget to extract thousands of pages from a defunct CMS and restructure the output in to static HTML designed for human maintenance and convert some of the more obnoxious attempts at formatting to CSS and many chunks to SSI. It will work easily for screen scraping and will be easy to update when the target changes its layout. You might want to pipe the fetched pages through Tidy first to make sure they are valid HTML or at least well-formed XHTML.
There is a corresponding XML module, XML::TreeBuilder, that functions in much the same way if you need to process XML feeds.

--
Money is not free speech. Elections should not be auctions.
- Re:Perl CPAN's HTML::TreeBuilderRe:Perl CPAN's HTML::TreeBuilder (Score: 2) by fliptop on Saturday June 04 2016, @07:30PM
  
  by fliptop (1666) on Saturday June 04 2016, @07:30PM (#355221) Journal
  
  To extract text using perl, I'd recommend HTML::TreeBuilder or HTML::TreeBuilder::XPath
  
  HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.
  
  --
  Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.
  
  Parent
  - Re:Perl CPAN's HTML::TreeBuilder(Score: 2) by canopic jug on Saturday June 04 2016, @08:28PM
    
    by canopic jug (3949) on Saturday June 04 2016, @08:28PM (#355256) Journal
    
    HTML::SimpleParse works pretty well too. Sites tend to have content in its own uniquely named div, so if you walk the $p->tree and find the div(s) you can easily retrieve what you're looking for.
    Yes, I looked at HTML::SimpleParse and HTML::TokeParser first. For the task I had, I quickly needed something more. With HTML::TreeBuilder you can make fairly complex selections and even replacements.
    But reading all of the summary again, I see that the problem is javascript, not so much the HTML parsing. I generally don't waste my time on sites that are so heavy with javascript that they won't work with it turned off. Those kinds of sites probably should be encouraged to die. If SN wants to keep those kinds of sites alive then maybe JavaScript::SpiderMonkey and / or WWW::Mechanize::Firefox could work to parse the garbage pages before processing. However, those scripts and any dependent applications really need to be severely chrooted and locked down.
    
    --
    Money is not free speech. Elections should not be auctions.
    
    Parent
Have to render the page to get the textHave to render the page to get the text (Score: 4, Interesting) by Beige on Saturday June 04 2016, @04:29PM

by Beige (3989) on Saturday June 04 2016, @04:29PM (#355137) Homepage

Render the page through a command line renderer like wkhtmltopdf ( http://wkhtmltopdf.org/ [wkhtmltopdf.org] ) and then convert the pdf to txt.. Should be pretty easy to then write a perl etc func which checks for paragraphs of 60+ words (i.e. skip any menus, links etc).
- Re:Have to render the page to get the text(Score: 2) by fishybell on Saturday June 04 2016, @05:24PM
  
  by fishybell (3156) on Saturday June 04 2016, @05:24PM (#355163)
  
  +1 for this. I used it for a project at work where it was easier to render something with html/css than with Tcl/Tk [www.tcl.tk] (what the app requiring the rendering was written in). I ran into the problem of platform compatibility with our version of Linux, but was using an older version of the software from its days on Google Code.
  
  Parent
HeuristicHeuristic (Score: 3, Interesting) by Non Sequor on Saturday June 04 2016, @04:46PM

by Non Sequor (1005) on Saturday June 04 2016, @04:46PM (#355145) Journal

First, the tool you need is a headless browser. These things exist, and they give you a console or API based way to load a page, including anything accomplished by JavaScript, and then allows you to get a DOM tree for the page elements. Some headless browsers even go as far as essentially doing all of the layout for graphical display and will allow you to generate screenshots or query DOM attributes related to display.
Now, once you're comfortable using a headless browser, the next step for what you want to do is to come up with a heuristic for grabbing an appropriate chunk of article text. I'm thinking a starting point might be to get a list of all of the document nodes with inner text and sort it by word count, possibly weighted by some DOM attribute that's indicative of the final rendered size of the text. From here, I think you can get a heuristic to identify the deepest node in the tree that contains the article body text: traverse the elements of the DOM tree using the parent relation starting form these listed elements, identify common parents, and stop when you have identified a node which is parent to x% of the [display size weighted] inner text.
Why do we want the deepest one? Because we want to avoid getting text from navigation bars, ads and comments sections. In general, all of those things should be made up of small chunks of text and seeding the search from the largest chunks of text will bias it towards the article section (plus weighting for display size may seal the deal).
Once you've identified that, either convert it all down to text and let the human editor pare it down, or take all of the text up to and including the first paragraph that is greater than n sentences long (for the sake of getting a decent amount of text from articles which start with a cutesy sequence of short paragraphs).

--
Write your congressman. Tell him he sucks.
- Headless browser?Headless browser? (Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM
  
  by hendrikboom (1125) on Saturday June 04 2016, @10:06PM (#355286) Homepage Journal
  
  I would very much like to know more about headless browsers. I dind't know there were any. Care to provide links or other information about them? Or about readily available components from which to build one's own?
  -- hendrik
  
  Parent
  - Re:Headless browser?(Score: 2) by isostatic on Saturday June 04 2016, @10:32PM
    
    by isostatic (365) on Saturday June 04 2016, @10:32PM (#355299) Journal
    
    PhantomJS is one, full webkit blown browser with javascript engine, css 3 support, etc etc.
    
    Parent
  - Re:Headless browser?(Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM
    
    by Non Sequor (1005) on Saturday June 04 2016, @10:33PM (#355300) Journal
    
    There's a proliferation of headless browsers right now based around different Javascript engines. Here's a StackOverflow page with a ton of them: http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions. [stackoverflow.com] I've used PhantomJS, primarily as just a JavaScript REPL, but I know it has a lot of features.
    I've also used WWW::Mechanize with Perl and it's easy to use. The only thing is, I think at this point it's a little primitive compared to the other options. Conceptually, I think it's a bit more lynx-like.
    Selenium is probably the most extreme level of engineering invested in a headless browser. It has multiple language frameworks and it works with multiple browsers. I found it harder to get started with.
    
    --
    Write your congressman. Tell him he sucks.
    
    Parent
Comment Below Threshold

RMS(Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @04:48PM

by Anonymous Coward on Saturday June 04 2016, @04:48PM (#355146)

https://soylentnews.org/article.pl?sid=16/06/02/214243 [soylentnews.org]
Anyone for OCR?(Score: 2) by JoeMerchant on Saturday June 04 2016, @05:05PM

by JoeMerchant (3937) on Saturday June 04 2016, @05:05PM (#355152)

If the page is graphically rendered, you can feed the image to OCR. The problem then is blocking ads and extraneous content... The approach is not without problems, but has the advantage that: if you can read the page, your scraper can read the page.

--
🌻🌻🌻 [google.com]
Add another stepAdd another step (Score: 3, Interesting) by Thexalon on Saturday June 04 2016, @05:32PM

by Thexalon (636) on Saturday June 04 2016, @05:32PM (#355167)

How about this: A "firehose"-like thing that allows a human volunteer to take that link, read the article, summarize it if it's any good, and turn it into a submission?

--
"Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
- Skip a step(Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:47PM
  
  by Anonymous Coward on Saturday June 04 2016, @05:47PM (#355172)
  
  If you're bored, why not just read the submissions in the queue?
  
  Parent
OkOk (Score: 3, Insightful) by Username on Saturday June 04 2016, @05:47PM

by Username (4557) on Saturday June 04 2016, @05:47PM (#355173)

I haven’t used perl since the year 2000ish, but there should be some kind of inet or wget function or library you can use to get the page. You can fool most javasbullscript websites into html mode by using a mobile browser useragent. If it doesn’t work just delete the submission or website from scrape list. Fuck em.
I would do it the template way. Most news site do not change their templates very often, and if they do it would be a trivial matter of update the scraping script to look for id="newstitle" instead of class="titlenews". Could even have some web page with a bunch of input boxes which content is read and written from some config file that stores the lookup terms for the script to use so moderators can just update it without having to update the script. If the term isnt found the script can just can fall back to other methods to find the content but you’ll be likely to get a bunch of garbage. I’d make this auto generated article a submission itself that has to be approved and not automatically shown on the page just in case it does fuck up.
- Re:Ok(Score: 2) by Username on Saturday June 04 2016, @06:12PM
  
  by Username (4557) on Saturday June 04 2016, @06:12PM (#355187)
  
  PS: Also I’d compare the articles on the scrape list to other articles that were scraped at the same time and link them together. I’d probably get the first 2000 characters of the article, remove common words like "of|the|a|it" etc turn all spaces and multi spaces into one char 32 space. Then load the remaining words into an array using that space as the delimiter and find each word in other articles arrays and use the amount of common words to determine relevancy. Or maybe just find some keyword from a keyword list and tag the article with it.
  Something like that.
  
  Parent
Scraper(Score: 2) by mcgrew on Saturday June 04 2016, @06:07PM

by mcgrew (701) <publish@mcgrewbooks.com> on Saturday June 04 2016, @06:07PM (#355182) Homepage Journal

I haven't had the need for a scraper, but coincidentally I ran across this site [htmlgoodies.com] yesterday when I was looking for a solution to a different problem, and it had Fetch Hyperlinked Files using Jsoup [htmlgoodies.com]. I never heard of Jsoup before and have no idea if it would work for you. Good Luck!

--
Impeach Donald Palpatine and his sidekick Elon Vader
No!No! (Score: 5, Insightful) by Anonymous Coward on Saturday June 04 2016, @06:39PM

by Anonymous Coward on Saturday June 04 2016, @06:39PM (#355195)

Don't do it!
1) If the article isn't good enough for an editor to vet and read, why is it good enough to waste every soylentil's time on?
2) Scraping JS means soylentils would need to enable JS. Have you /spoken/ to us? When was the last time one of us /wasn't/ running noscript-or-equivalent??
- Re:No!(Score: 2) by urza9814 on Tuesday June 07 2016, @11:59PM
  
  by urza9814 (3954) on Tuesday June 07 2016, @11:59PM (#356662) Journal
  
  1) If the article isn't good enough for an editor to vet and read, why is it good enough to waste every soylentil's time on?
  Seems that they don't need editors, they need article submissions. That's what this is intended to solve. The articles will still be edited as usual, this will just ensure that the editors have more than just a bare link to start with.
  2) Scraping JS means soylentils would need to enable JS. Have you /spoken/ to us? When was the last time one of us /wasn't/ running noscript-or-equivalent??
  The idea is for the SN servers to run the JavaScript so you don't have to.
  
  Parent
Beautiful Soup?Beautiful Soup? (Score: 2) by goodie on Saturday June 04 2016, @07:03PM

by goodie (1877) on Saturday June 04 2016, @07:03PM (#355212) Journal

Not sure that would work for you, but Beautiful Soup (python) was made for this type thing I reckon. I've used it to clean up messy HTML etc. to retrieve only text contents from within a webpage, and it was easy to use and very good at it (for my purpose).
https://www.crummy.com/software/BeautifulSoup/ [crummy.com]
- Re:Beautiful Soup?(Score: 0) by Anonymous Coward on Saturday June 04 2016, @07:11PM
  
  by Anonymous Coward on Saturday June 04 2016, @07:11PM (#355214)
  
  BS works until you hit heavy JS or Flash sites then not so much.
  Source: I write scrappers from time to time with BS/Python for my conky info windows.
  
  Parent
I have used lynx for this(Score: 2) by AudioGuy on Saturday June 04 2016, @07:04PM

by AudioGuy (24) on Saturday June 04 2016, @07:04PM (#355213) Journal

yourshell # lynx -dump http://soylentnews.org/ [soylentnews.org]
(dumps to your screen plain text and a list of links at bottom)
Its reasonably nicely formatted for parsing.
Your grep skills will be improved. :-)
WWW::Mechanize(Score: 2) by jdavidb on Saturday June 04 2016, @07:39PM

by jdavidb (5690) on Saturday June 04 2016, @07:39PM (#355225) Homepage Journal

The correct way to scrape in Perl is the awesome WWW::Mechanize by the fantastic Andy Lester [github.com], author of ack [beyondgrep.com].

--
ⓋⒶ☮✝🕊 Secession is the right of all sentient beings
Selenium(Score: 2, Informative) by DonkeyChan on Saturday June 04 2016, @08:18PM

by DonkeyChan (5551) on Saturday June 04 2016, @08:18PM (#355246)

http://www.seleniumhq.org/ [seleniumhq.org]
I use selenium for all my unit tests an it's been far easier to implement than WWW::Mechanize or other solutions.
solution(Score: 0) by Anonymous Coward on Saturday June 04 2016, @08:22PM

by Anonymous Coward on Saturday June 04 2016, @08:22PM (#355248)

firefox plugin: submit for consideration ... abuse ... ooops
Some Perl resources you might find useful...(Score: 3, Informative) by number6 on Saturday June 04 2016, @09:43PM

by number6 (1831) on Saturday June 04 2016, @09:43PM (#355282) Journal

Web::Scraper
http://search.cpan.org/dist/Web-Scraper/ [cpan.org]
Perl module, Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions, by Tatsuhiko Miyagawa.
* A blog article............http://perlgerl.wordpress.com/2011/02/27/219/
* A comment...............Web::Scraper is so effin awesome. You just yank up some LWP from disk,
add some scraper to it and whambo! you have yourself some local data that the web universe
used to call it's own. Follow that author on CPAN; he's a genius.

How to fetch urls? I want to build a Perl web scraper like Python's urllib
https://www.reddit.com/r/perl/comments/3oiwqn/how_to_fetch_urls_like_in_pythons_urllib/ [reddit.com]

Web Scraping with Perl & PhantomJS (headless WebKit browser)
article, Feb 2013, by Rob Hammond (Perl programmer)
http://blogs.perl.org/users/robhammond/2013/02/web-scraping-with-perl-phantomjs.html [perl.org]

Perl Proxy Scraper
https://phl4nk.wordpress.com/2015/04/11/perl-proxy-scraper/ [wordpress.com]
* A comment...............Neat! I toyed around with a one-liner that does about the same thing for standard output:
perl -Mojo -E 'say g($_)->body =~ /((?:\d{1,3}\.?){4}:\d{1,4})/g for @ARGV' $url1 $url2 ...

how-to-develop-a-good-scraper-on-perl - Google Search [google.com]
re: headless browser(Score: 0) by Anonymous Coward on Saturday June 04 2016, @10:29PM

by Anonymous Coward on Saturday June 04 2016, @10:29PM (#355298)

BlackWidow
by http://SoftByteLabs.com/ [softbytelabs.com]
Published June 11, 2004
This is a multi-function Internet tool. It is an offline browser, Web site scanner, a site mapping tool, a site ripper, a site mirroring tool and a FTP scanner. Use it to scan a site and create a complete profile of the site's structure, files, external links and link errors. Then use it to download the web site with its structure and files intact, to use as a site mirror or to be converted by BlackWidow into a locally linked site for off-line browsing and long-term reference. Or use it to scan for and download any selection of files in part of a site or in a group of sites.
BlackWidow will scan HTTP sites, SSL sites and FTP sites. You can access password-protected sites, use threads, pull links from Java Scripts and Java Scripts file, and resume broken downloads. View, edit and print the structure of a Web site, write your own plug-ins and automatically load your plug-ins. Will also scan Adobe Acrobat files for links.
https://archive.org/details/tucows_193802_BlackWidow [archive.org]
No news(Score: 4, Informative) by isostatic on Saturday June 04 2016, @10:37PM

by isostatic (365) on Saturday June 04 2016, @10:37PM (#355301) Journal

Good evening. Today is Good Friday. There is no news
If there's nothing to report, don't report it. Dont go for an 18 minute news program consisting of 15 mintues of "old sportsman has died" and a couple of minutes of "politicians lie" as happened with the BBC this evening.
*cough* RSS *cough*(Score: 0) by Anonymous Coward on Sunday June 05 2016, @12:33AM

by Anonymous Coward on Sunday June 05 2016, @12:33AM (#355331)

Does the site have an RSS feed? That's a whole lot easier to parse.
Can't fix the scraping problem..(Score: 3, Informative) by Hairyfeet on Sunday June 05 2016, @02:55AM

by Hairyfeet (75) <reversethis-{moc ... {8691tsaebssab}> on Sunday June 05 2016, @02:55AM (#355378) Journal

But there is an easy way to fix the "queue runs dry" issue which is where I've gotten every article I've posted here...Daily Rotation [dailyrotation.com] which is nothing but science and tech, VERY nerdy and right up this site's alley. As just a couple examples, just me grabbing the latest headlines...a missing features installer for XP [betanews.com] from the guy that made the MFI for Win 10, FBI says TOR child porn exploit wasn't malware [arstechnica.com] and Japan's battleship island [cnet.com] with some great pictures.
So if the site starts to run dry? just head over to Rotation and grab the headlines, its all the nerdy kind of stuff the old site USED to cover.

--
ACs are never seen so don't bother. Always ready to show SJWs for the racists they are.
StoryBot(Score: 2) by Fnord666 on Monday June 06 2016, @01:26AM

by Fnord666 (652) on Monday June 06 2016, @01:26AM (#355658) Homepage

Have you considered contacting janrinok about the StoryBot software that he uses to submit stories?
If the sites need JS...(Score: 0) by Anonymous Coward on Monday June 06 2016, @03:36PM

by Anonymous Coward on Monday June 06 2016, @03:36PM (#355937)

...use a headless browser like Phantom JS (http://phantomjs.org/ [phantomjs.org]) to pull the desired content.
XHTML + XSLT(Score: 0) by Anonymous Coward on Monday June 06 2016, @10:20PM

by Anonymous Coward on Monday June 06 2016, @10:20PM (#356151)

Pages that require Javascript are a plain, and I can't answer that problem.
But if you can reach of point of getting the page content into an html file, I can say that I had good success using the following to scrape content: I'd use tidy (https://www.google.com/#q=tidy+xhtml) to convert the html into XML parsable XHTML. Then I'd use XSLT to mine the content I wanted out of the XHTML. The advantage to this was the XSLT was fully programmable without having to recompile any code, so when sites had minor page layout changes, usually the tidy+XSLT continued working without noticing.

Moderator Help

Be careful of reading health books, you might die of a misprint. -- Mark Twain

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Related Links

Help SoylentNews — "Scraping" Web Pages

Calibre?Calibre? (Score: 2) by Runaway1956 on Saturday June 04 2016, @03:44PM

Re:Calibre?(Score: 4, Insightful) by frojack on Saturday June 04 2016, @04:38PM

Let it sitLet it sit (Score: 5, Insightful) by frojack on Saturday June 04 2016, @03:46PM

Re:Let it sit(Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:35PM

Re:Let it sitRe:Let it sit (Score: 1) by idetuxs on Saturday June 04 2016, @06:40PM

Re:Let it sitRe:Let it sit (Score: 3, Informative) by frojack on Saturday June 04 2016, @08:00PM

Re:Let it sitRe:Let it sit (Score: 2) by Runaway1956 on Sunday June 05 2016, @02:21AM

Re:Let it sit(Score: 3, Insightful) by frojack on Sunday June 05 2016, @07:36AM

Re:Let it sitRe:Let it sit (Score: 2) by martyb on Sunday June 05 2016, @01:15PM

Re:Let it sitRe:Let it sit (Score: 2) by frojack on Sunday June 05 2016, @05:08PM

Re:Let it sit(Score: 2) by martyb on Monday June 06 2016, @03:40AM

Javascript Much Rarer Than ClaimedJavascript Much Rarer Than Claimed (Score: 0) by Anonymous Coward on Saturday June 04 2016, @03:50PM

Re:Javascript Much Rarer Than ClaimedRe:Javascript Much Rarer Than Claimed (Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @05:13PM

Comment Below Threshold

Re:Javascript Much Rarer Than ClaimedRe:Javascript Much Rarer Than Claimed (Score: -1, Troll) by Anonymous Coward on Saturday June 04 2016, @05:25PM

Comment Below Threshold

Re:Javascript Much Rarer Than Claimed(Score: -1, Redundant) by Anonymous Coward on Saturday June 04 2016, @05:41PM

Re:Javascript Much Rarer Than Claimed(Score: 2) by Runaway1956 on Sunday June 05 2016, @02:27AM

Some sites are worth ignoringSome sites are worth ignoring (Score: 5, Insightful) by bradley13 on Saturday June 04 2016, @03:51PM

Make the web great again! Lynx 2016(Score: 2, Insightful) by Anonymous Coward on Saturday June 04 2016, @04:25PM

web2.0 -> javascript requiredweb2.0 -> javascript required (Score: 2, Interesting) by isj on Saturday June 04 2016, @03:56PM

Re:web2.0 -> javascript required(Score: 1, Informative) by Anonymous Coward on Saturday June 04 2016, @04:18PM

firefox, reader?(Score: 2) by opinionated_science on Saturday June 04 2016, @04:12PM

Perl CPAN's HTML::TreeBuilderPerl CPAN's HTML::TreeBuilder (Score: 3, Interesting) by canopic jug on Saturday June 04 2016, @04:18PM

Re:Perl CPAN's HTML::TreeBuilderRe:Perl CPAN's HTML::TreeBuilder (Score: 2) by fliptop on Saturday June 04 2016, @07:30PM

Re:Perl CPAN's HTML::TreeBuilder(Score: 2) by canopic jug on Saturday June 04 2016, @08:28PM

Have to render the page to get the textHave to render the page to get the text (Score: 4, Interesting) by Beige on Saturday June 04 2016, @04:29PM

Re:Have to render the page to get the text(Score: 2) by fishybell on Saturday June 04 2016, @05:24PM

HeuristicHeuristic (Score: 3, Interesting) by Non Sequor on Saturday June 04 2016, @04:46PM

Headless browser?Headless browser? (Score: 2) by hendrikboom on Saturday June 04 2016, @10:06PM

Re:Headless browser?(Score: 2) by isostatic on Saturday June 04 2016, @10:32PM

Re:Headless browser?(Score: 3, Informative) by Non Sequor on Saturday June 04 2016, @10:33PM

Comment Below Threshold

RMS(Score: -1, Offtopic) by Anonymous Coward on Saturday June 04 2016, @04:48PM

Anyone for OCR?(Score: 2) by JoeMerchant on Saturday June 04 2016, @05:05PM

Add another stepAdd another step (Score: 3, Interesting) by Thexalon on Saturday June 04 2016, @05:32PM

Skip a step(Score: 0) by Anonymous Coward on Saturday June 04 2016, @05:47PM

OkOk (Score: 3, Insightful) by Username on Saturday June 04 2016, @05:47PM

Re:Ok(Score: 2) by Username on Saturday June 04 2016, @06:12PM

Scraper(Score: 2) by mcgrew on Saturday June 04 2016, @06:07PM

No!No! (Score: 5, Insightful) by Anonymous Coward on Saturday June 04 2016, @06:39PM

Re:No!(Score: 2) by urza9814 on Tuesday June 07 2016, @11:59PM

Beautiful Soup?Beautiful Soup? (Score: 2) by goodie on Saturday June 04 2016, @07:03PM

Re:Beautiful Soup?(Score: 0) by Anonymous Coward on Saturday June 04 2016, @07:11PM

I have used lynx for this(Score: 2) by AudioGuy on Saturday June 04 2016, @07:04PM

WWW::Mechanize(Score: 2) by jdavidb on Saturday June 04 2016, @07:39PM

Selenium(Score: 2, Informative) by DonkeyChan on Saturday June 04 2016, @08:18PM

solution(Score: 0) by Anonymous Coward on Saturday June 04 2016, @08:22PM

Some Perl resources you might find useful...(Score: 3, Informative) by number6 on Saturday June 04 2016, @09:43PM

re: headless browser(Score: 0) by Anonymous Coward on Saturday June 04 2016, @10:29PM

No news(Score: 4, Informative) by isostatic on Saturday June 04 2016, @10:37PM

*cough* RSS *cough*(Score: 0) by Anonymous Coward on Sunday June 05 2016, @12:33AM

Can't fix the scraping problem..(Score: 3, Informative) by Hairyfeet on Sunday June 05 2016, @02:55AM

StoryBot(Score: 2) by Fnord666 on Monday June 06 2016, @01:26AM

If the sites need JS...(Score: 0) by Anonymous Coward on Monday June 06 2016, @03:36PM

XHTML + XSLT(Score: 0) by Anonymous Coward on Monday June 06 2016, @10:20PM

cough RSS cough(Score: 0) by Anonymous Coward on Sunday June 05 2016, @12:33AM