Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday July 31 2019, @06:36AM   Printer-friendly
from the Arachnophobia dept.

"Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh..."

Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.

Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.

So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.

Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.

https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Informative) by Anonymous Coward on Wednesday July 31 2019, @06:47AM (16 children)

    by Anonymous Coward on Wednesday July 31 2019, @06:47AM (#873450)

    Nothing is legal, that's why you need high powered $200/hr lawyers to do anything. See you in court soon, probably.

    • (Score: 5, Insightful) by The Mighty Buzzard on Wednesday July 31 2019, @07:13AM (13 children)

      Which is more or less what the author's point was. The headline was pure clickbait bullshit though. You can scrape anything you like that is intentionally made publicly accessible. You are not legally bound by anyone's ToS; they are not an EULA or any other sort of contract. Now if you're redistributing something that even can hold copyright you could get in trouble for that but that's another discussion entirely.

      Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.

      --
      My rights don't end where your fear begins.
      • (Score: 4, Informative) by FatPhil on Wednesday July 31 2019, @07:23AM (8 children)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday July 31 2019, @07:23AM (#873461) Homepage
        Yup, I also took umbrage at the absolute "You're legally bound by those terms" in point 3.

        I'm currently scraping a website I've used for over a decade in order to provide super-nerdy stats that lots of regular users of the site love. I'm definitely on the iffy side of the law here, but I do know that I'm adding value to the site so they shouldn't have a problem with what I do, and I also know that they're not GDPR compliant, so if they pull the lawyers in, they're on shaky ground. So I shall continue...
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @09:12AM (3 children)

          by Anonymous Coward on Wednesday July 31 2019, @09:12AM (#873474)

          There is legally nothing they can bind you to to stop you scraping or crawling a public website, same as you can do with a printed newspaper.
          Except cutting pieces to compose a ramson letter...

          • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @03:07PM (1 child)

            by Anonymous Coward on Wednesday July 31 2019, @03:07PM (#873572)

            If you took that printed newspaper and ran every page through a (very large!) copier and then started handing out copies of it on the street corner you might find out very quickly that there's something that can be done about that.

            • (Score: 3, Interesting) by Freeman on Wednesday July 31 2019, @05:43PM

              by Freeman (732) Subscriber Badge on Wednesday July 31 2019, @05:43PM (#873631) Journal

              True. Also True: I can copy all of the freely available stuff. Then use that stuff to come up with statistics or what not on my own. Then, legally publish those statistics.

              --
              Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
          • (Score: 2) by FatPhil on Thursday August 01 2019, @08:42AM

            by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Thursday August 01 2019, @08:42AM (#873919) Homepage
            Hmmm, you seem about 300 years behind the times when it comes to IP laws...
            --
            Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @12:13PM (3 children)

          by Anonymous Coward on Wednesday July 31 2019, @12:13PM (#873499)

          I'm currently scraping a website I've used for over a decade in order to provide super-nerdy stats that lots of regular users of the site love. I'm definitely on the iffy side of the law here

          Why? Scrapping a website is exactly what Google is doing every time their bot visits. Being illegal or not has little bearing on whether someone will sue you or not.

          and I also know that they're not GDPR compliant, so if they pull the lawyers in, they're on shaky ground. So I shall continue...

          And then you enter into the world of illegal. Blackmail/extortion are illegal. GDPR compliance, something completely different. Purpose of GDPR is to allow users to control their own information on a service. But as you can see from exceptions they have in law, it's not exactly a perfect idea in all cases.

          • (Score: 2) by JoeMerchant on Wednesday July 31 2019, @12:28PM

            by JoeMerchant (3937) on Wednesday July 31 2019, @12:28PM (#873500)

            Scrapping a website is exactly what Google is doing every time their bot visits

            Some sites want to be scraped, some don't.

            Some women want men knocking on their door in the middle of the night, some don't - in some places this is signified by a red light in the window.

            The question is: is the visitor providing welcome value, or not. When you play in the big leagues, you run the risk of legal harassment.

            --
            Україна досі не є частиною Росії Слава Україні🌻 https://news.stanford.edu/2023/02/17/will-russia-ukraine-war-end
          • (Score: 2) by Freeman on Wednesday July 31 2019, @05:50PM (1 child)

            by Freeman (732) Subscriber Badge on Wednesday July 31 2019, @05:50PM (#873635) Journal

            Blackmail may also be considered a form of extortion.[1] Although the two are generally synonymous, extortion is the taking of personal property by threat of future harm.[6] Blackmail is the use of threat to prevent another from engaging in a lawful occupation and writing libelous letters or letters that provoke a breach of the peace, as well as use of intimidation for purposes of collecting an unpaid debt.[7]

            https://en.m.wikipedia.org/wiki/Blackmail [wikipedia.org]

            That's not what they were describing. They aren't trying to do anything harmful or injurious to the party and they aren't threatening them. They just outlined what they might do in defense, if the other party decided to sue them, for no good reason.

            --
            Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
            • (Score: 2) by Osamabobama on Wednesday July 31 2019, @07:37PM

              by Osamabobama (5842) on Wednesday July 31 2019, @07:37PM (#873693)

              That's not what they were describing. They aren't trying to do anything harmful or injurious to the party and they aren't threatening them. They just outlined what they might do in defense, if the other party decided to sue them, for no good reason.

              There are too many pronouns to follow. I will replace them as I understand this situation, as it relates to blackmail:

              That's not what (they - the blogger) were describing. (They - the website scrapers) aren't trying to do anything harmful or injurious to the party and (they - the website scrapers) aren't threatening (them - the scraped website operators). (They - the website scrapers) just outlined what (they - the website scrapers) might do in defense, if the (other party - the scraped website operators) decided to sue them, for no good reason.

              --
              Appended to the end of comments you post. Max: 120 chars.
      • (Score: 1, Insightful) by Anonymous Coward on Wednesday July 31 2019, @05:06PM (2 children)

        by Anonymous Coward on Wednesday July 31 2019, @05:06PM (#873613)

        You can scrape anything you like that is intentionally made publicly accessible. You are not legally bound by anyone's ToS; they are not an EULA or any other sort of contract.

        So I can start photocopying copies of "Harry Potter and the Sorcerer's Stone" and sell them for profit? Sweet...

        I am not a lawyer.

        Clearly, you are not, either.

        Even if something is made publicly available, the author still owns the rights to copy it (copyright), and has legal privileges. However, as they have made it available to the public, they have forsaken many legal privileges as well. As which privileges are still held and which are forsaken... well, that's a question for a lawyer.

        • (Score: 3, Informative) by The Mighty Buzzard on Wednesday July 31 2019, @10:40PM

          Learn to read and learn copyright law. For starters we're not talking about anything you have to pay to see, we're talking shit put out for anyone to view at no cost. Also, you can sit there and make as many copies of Harry Potter as you like every day for a year and nothing will happen to you. Distributing them is what will get you in trouble. Running word count analysis on them will not.

          --
          My rights don't end where your fear begins.
        • (Score: 3, Insightful) by jbruchon on Thursday August 01 2019, @03:14AM

          by jbruchon (4473) on Thursday August 01 2019, @03:14AM (#873849) Homepage

          It's amazing how you got "make a copy (which violates the owner's copyright)" out of "send an HTTP request to the owner for a copy and receive a copy from the owner (you know, the person who's explicitly allowed to make copies and send them to people)." Not a lawyer? If you are, I'd predict that you won't be for long. Next thing you know, you'll be arguing that ad blocking is theft because somehow it's illegal to NOT go get other data after you have already been given a copy of the data by the owner of the data.

          --
          I'm just here to listen to the latest song about butts.
      • (Score: 2) by edIII on Wednesday July 31 2019, @07:47PM

        by edIII (791) on Wednesday July 31 2019, @07:47PM (#873699)

        That other discussion though is worth being had. I've been involved in things like this, and always involving publicly owned data. This reminds me of certain cities trying to sue people claiming they own the train schedules and are the only ones who can publish the data, which is complete and utter anti-American fascist bullshit.

        The reason why they want to maintain control is because pre-Internet they charged to print the data, and then later tried charging a couple hundred dollars for a FTP account to access the data. Even though, the website makes all the information easily and cheaply available. The web scrapers aren't even taking that many resources.

        In this specific situation, I recommend fighting the fucking bastards in the government tooth and nail. They don't deserve absolute control of what is publicly owned data.

        --
        Technically, lunchtime is at any moment. It's just a wave function.
    • (Score: 1, Touché) by Anonymous Coward on Wednesday July 31 2019, @12:01PM

      by Anonymous Coward on Wednesday July 31 2019, @12:01PM (#873495)

      $200/hr is y2k price.

    • (Score: 1, Interesting) by Anonymous Coward on Wednesday July 31 2019, @01:15PM

      by Anonymous Coward on Wednesday July 31 2019, @01:15PM (#873517)

      Nothing is legal

      I regularly get people asking me if such-and-such various random thing is legal or illegal. Of course I am not a lawyer. All I can tell people is to assume everything is illegal. Even when something is legal, those with lots of money can still try to legally harass you, draining your wallet. Lots of things are illegal that should not be, and the normal person has no way to know for themselves every little thing that is illegal/legal or what exceptions there are.

      The legal system is a bad joke.

  • (Score: 2) by upstart on Wednesday July 31 2019, @06:51AM

    by upstart (6666) Subscriber Badge on Wednesday July 31 2019, @06:51AM (#873452) Journal

    They're on to me!

  • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @01:28PM

    by Anonymous Coward on Wednesday July 31 2019, @01:28PM (#873523)

    it's always a robot that fetches a webpage. a webbrowser is not a human but a "dumb robot".
    that said, what is worse then a "dumb robot" is one that thinks it's smart and gets stuck in a infinite loop following links because they are dynamically generated thus the same 3 pages get loaded over and over but the hyperlink identifying them has changed ...
    scrapping: no like !but have to live with it because i publish the data publically.
    i cannot do anything about a person standing infront of a "have you seen this missing cat?" poster for hours without end trying to decypher it and keeping other people from seeing it?

  • (Score: 4, Interesting) by bzipitidoo on Wednesday July 31 2019, @02:12PM (1 child)

    by bzipitidoo (4388) Subscriber Badge on Wednesday July 31 2019, @02:12PM (#873544) Journal

    Anyone can sue for any reason at all. The lawsuit might be thrown out of court as groundless, if it gets that far.

    But court is not where most accusers want to go. They want the accused to chicken out and settle out of court. Want the accused to pay the accuser lots of money, or agree not to compete, or anything else the accuser does not like, without ever realizing that the accuser was full of it, and actually had no legal grounds for the accusation and would have lost in court. That's what a lot of the stuff in EULAs and ToS are really all about. They're trying to scare people away from exercising their rights.

    In the case of the MAFIAA vs pirates, their threats are not entirely groundless, because the law is antiquated, and they lobby to keep it that way. But they've been learning that piracy is a lot, lot bigger than they are, and gradually realizing that they've taken on nothing less than the laws of nature, which is ultimately unwinnable. Unfortunately, until the law catches up with reality, they can cause a lot of misery, and accuse anyone of piracy. Those sick bastards have been known to pick on the weakest possible defendants, regardless of the credibility of a piracy accusation.

    Of course costing the accused a great deal of money and time, defending themselves from baseless accusations, is the point of legal harassment, particularly of the variety that does go to court.

    There are also the outright criminal frauds. Like the fake message allegedly from the IRS that you owe lots of taxes and will be in Big Trouble if you don't pay Right Now.

    I'd guess it works entirely too often and too well. Lot of people are afraid of a fight, and for good reason.

    It can get to the point where, yes, it seems that nothing is legal. Yet somehow, it is legal for these writers of legal bull to leave out exceptions, and just plain lie. For instance, the NFL routinely claims you can't make any use whatsoever of footage of an NFL football game without prior written permission from them. They're wrong, and they know it. They lie. Why they're allowed to overstate their rights like that, and no one smacks them down for it, I don't know. Of course, one reason is that they're powerful.

    Another example is lawn care. Somehow, notices of violations that your Grass is Too High (over 12 inches in some places, and only 6 or even 4 inches in other places) leave out exceptions, such as for vegetable gardens. I continue to be astonished at the love for these Home Owners Associations and city ordinances, and their incredibly petty and fascist rules. Why do people not only put up with it, but go for it?

    Lately we've been seeing a lot of news stories about racists calling the police on brown individuals who weren't doing anything wrong. Permit Patty, for instance.

    In sum, don't let the likes of Permit Patty fool you.

    • (Score: 0) by Anonymous Coward on Thursday August 01 2019, @12:56AM

      by Anonymous Coward on Thursday August 01 2019, @12:56AM (#873805)

      All they need is for the cost of getting them laughed out of court to be more than you can afford. If your choices are settle or bankruptcy, what do you do?

  • (Score: 2) by looorg on Wednesday July 31 2019, @02:21PM

    by looorg (578) on Wednesday July 31 2019, @02:21PM (#873547)

    I'm not some fancy lawyer either but I'm fairly certain they'll be shit out of luck upholding that somehow. If I can browse it I can scrape it. Even if using the program, script, bot or whatnot was somehow illegal how is ctrl+a ctrl+c ctrl+v going to be deemed illegal? Seriously if you don't want people to use your information then don't put it out there. All it does is saving me time, only grabbing what I really need and saving it in a format that I like it. It's like selective browsing if anything.

    I just glanced the blogpost and it seems to be summed up as one giant perhaps, perhaps, perhaps and that it's some kind of grey area. I gather that Linkedin isn't happy about others taking part of all that data that their drone followers that want to "network" is falling into the competitions hands, but then it's just so much they can do about that. Asking for data and then saying it's all secret and shit? That must be some kind of a joke. Still they might be in a better spot then others since I seem to recall they requiring some kind of account and login to take part in certain actions -- that is really as much as I know about Linkedin, it seemed like a stupid idea overall and I didn't fancy giving them any free information just as I don't think it's a great idea to feed Facebook et al any data either.

    That said if you have data, someone will scrape it so you might as well just make it all easier on yourself then and get an API for accessing said data. Cause that publicly available data is getting scraped one way or another so why make it harder then you have to really. I care about their Robot.txt and scraping speeds etc as much as search engines etc do (ie not at all).

  • (Score: 1, Insightful) by Anonymous Coward on Wednesday July 31 2019, @03:01PM

    by Anonymous Coward on Wednesday July 31 2019, @03:01PM (#873569)

    Copyright gives one the right to control something called 'copying'
    A long time ago, with paper publishing, this was easy to define.

    With computers, less so.
    For example if you buy an e-book, does your computer make another copy when it puts it on the screen for reading?
    Reason does not seem to rule.

    If it did, then perhaps for the web, publishing (IE serving the web page) should be copying and the actions at the page receiving end should not be.
    If one saves a copy at the receiving end, that requires permission for making a copy.
    If one saves a fair-use excerpt or some non-copyrighted data, then no permission necessary.
    Using a url that is obvious from permitted browsing should be fair with automatic means, as long as it does not overload the server.
    Not sure how constructing a strange url to convince a site to cough up something private should be protected, but copyright seems wrong.

    So, given that Mickey Mouse seems to be the one picking the rules, how does one get reasonableness back into the game?

  • (Score: 1, Insightful) by Anonymous Coward on Wednesday July 31 2019, @03:08PM (3 children)

    by Anonymous Coward on Wednesday July 31 2019, @03:08PM (#873573)

    Web sites seem to have very limiting things that one can do with them,
        but at the same time, they do wild west things with my browser.

    They want browsers to play nice with their end, but anything goes if they can get away with it at my end.

    How come the laws protecting against unauthorized computer access don't apply at my end?

    • (Score: 3, Insightful) by Dr Spin on Wednesday July 31 2019, @06:04PM (2 children)

      by Dr Spin (5239) on Wednesday July 31 2019, @06:04PM (#873647)

      How come the laws protecting against unauthorized computer access don't apply at my end?

      How many congress-critters do you own? Zero?

      - could be the answer you are looking for is right in front of your nose!

      --
      Warning: Opening your mouth may invalidate your brain!
      • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @06:53PM (1 child)

        by Anonymous Coward on Wednesday July 31 2019, @06:53PM (#873680)

        No, I think it's because their TOS says limited at their end and wild west at the user's end.

        That TOS is the contract that governs the relationship.

        The thing is, does that lopsided of a contract indicate that no reasonable person would agree to it if they had a choice?
        Is that relevant to the validity of the contract?

        • (Score: 0) by Anonymous Coward on Thursday August 01 2019, @01:02AM

          by Anonymous Coward on Thursday August 01 2019, @01:02AM (#873806)

          In a contract of adhesion clauses that are unconscionable or outright illegal aren't supposed to be enforceable, but again, how many politicians and judges do you own? How much can you afford to spend on legal fees?

(1)