Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday July 31 2019, @06:36AM   Printer-friendly
from the Arachnophobia dept.

"Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh..."

Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.

Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.

So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.

Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.

https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by The Mighty Buzzard on Wednesday July 31 2019, @07:13AM (13 children)

    by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Wednesday July 31 2019, @07:13AM (#873458) Homepage Journal

    Which is more or less what the author's point was. The headline was pure clickbait bullshit though. You can scrape anything you like that is intentionally made publicly accessible. You are not legally bound by anyone's ToS; they are not an EULA or any other sort of contract. Now if you're redistributing something that even can hold copyright you could get in trouble for that but that's another discussion entirely.

    Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.

    --
    My rights don't end where your fear begins.
    Starting Score:    1  point
    Moderation   +4  
       Insightful=3, Interesting=1, Total=4
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 4, Informative) by FatPhil on Wednesday July 31 2019, @07:23AM (8 children)

    by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Wednesday July 31 2019, @07:23AM (#873461) Homepage
    Yup, I also took umbrage at the absolute "You're legally bound by those terms" in point 3.

    I'm currently scraping a website I've used for over a decade in order to provide super-nerdy stats that lots of regular users of the site love. I'm definitely on the iffy side of the law here, but I do know that I'm adding value to the site so they shouldn't have a problem with what I do, and I also know that they're not GDPR compliant, so if they pull the lawyers in, they're on shaky ground. So I shall continue...
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @09:12AM (3 children)

      by Anonymous Coward on Wednesday July 31 2019, @09:12AM (#873474)

      There is legally nothing they can bind you to to stop you scraping or crawling a public website, same as you can do with a printed newspaper.
      Except cutting pieces to compose a ramson letter...

      • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @03:07PM (1 child)

        by Anonymous Coward on Wednesday July 31 2019, @03:07PM (#873572)

        If you took that printed newspaper and ran every page through a (very large!) copier and then started handing out copies of it on the street corner you might find out very quickly that there's something that can be done about that.

        • (Score: 3, Interesting) by Freeman on Wednesday July 31 2019, @05:43PM

          by Freeman (732) on Wednesday July 31 2019, @05:43PM (#873631) Journal

          True. Also True: I can copy all of the freely available stuff. Then use that stuff to come up with statistics or what not on my own. Then, legally publish those statistics.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by FatPhil on Thursday August 01 2019, @08:42AM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Thursday August 01 2019, @08:42AM (#873919) Homepage
        Hmmm, you seem about 300 years behind the times when it comes to IP laws...
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 0) by Anonymous Coward on Wednesday July 31 2019, @12:13PM (3 children)

      by Anonymous Coward on Wednesday July 31 2019, @12:13PM (#873499)

      I'm currently scraping a website I've used for over a decade in order to provide super-nerdy stats that lots of regular users of the site love. I'm definitely on the iffy side of the law here

      Why? Scrapping a website is exactly what Google is doing every time their bot visits. Being illegal or not has little bearing on whether someone will sue you or not.

      and I also know that they're not GDPR compliant, so if they pull the lawyers in, they're on shaky ground. So I shall continue...

      And then you enter into the world of illegal. Blackmail/extortion are illegal. GDPR compliance, something completely different. Purpose of GDPR is to allow users to control their own information on a service. But as you can see from exceptions they have in law, it's not exactly a perfect idea in all cases.

      • (Score: 2) by JoeMerchant on Wednesday July 31 2019, @12:28PM

        by JoeMerchant (3937) on Wednesday July 31 2019, @12:28PM (#873500)

        Scrapping a website is exactly what Google is doing every time their bot visits

        Some sites want to be scraped, some don't.

        Some women want men knocking on their door in the middle of the night, some don't - in some places this is signified by a red light in the window.

        The question is: is the visitor providing welcome value, or not. When you play in the big leagues, you run the risk of legal harassment.

        --
        🌻🌻 [google.com]
      • (Score: 2) by Freeman on Wednesday July 31 2019, @05:50PM (1 child)

        by Freeman (732) on Wednesday July 31 2019, @05:50PM (#873635) Journal

        Blackmail may also be considered a form of extortion.[1] Although the two are generally synonymous, extortion is the taking of personal property by threat of future harm.[6] Blackmail is the use of threat to prevent another from engaging in a lawful occupation and writing libelous letters or letters that provoke a breach of the peace, as well as use of intimidation for purposes of collecting an unpaid debt.[7]

        https://en.m.wikipedia.org/wiki/Blackmail [wikipedia.org]

        That's not what they were describing. They aren't trying to do anything harmful or injurious to the party and they aren't threatening them. They just outlined what they might do in defense, if the other party decided to sue them, for no good reason.

        --
        Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
        • (Score: 2) by Osamabobama on Wednesday July 31 2019, @07:37PM

          by Osamabobama (5842) on Wednesday July 31 2019, @07:37PM (#873693)

          That's not what they were describing. They aren't trying to do anything harmful or injurious to the party and they aren't threatening them. They just outlined what they might do in defense, if the other party decided to sue them, for no good reason.

          There are too many pronouns to follow. I will replace them as I understand this situation, as it relates to blackmail:

          That's not what (they - the blogger) were describing. (They - the website scrapers) aren't trying to do anything harmful or injurious to the party and (they - the website scrapers) aren't threatening (them - the scraped website operators). (They - the website scrapers) just outlined what (they - the website scrapers) might do in defense, if the (other party - the scraped website operators) decided to sue them, for no good reason.

          --
          Appended to the end of comments you post. Max: 120 chars.
  • (Score: 1, Insightful) by Anonymous Coward on Wednesday July 31 2019, @05:06PM (2 children)

    by Anonymous Coward on Wednesday July 31 2019, @05:06PM (#873613)

    You can scrape anything you like that is intentionally made publicly accessible. You are not legally bound by anyone's ToS; they are not an EULA or any other sort of contract.

    So I can start photocopying copies of "Harry Potter and the Sorcerer's Stone" and sell them for profit? Sweet...

    I am not a lawyer.

    Clearly, you are not, either.

    Even if something is made publicly available, the author still owns the rights to copy it (copyright), and has legal privileges. However, as they have made it available to the public, they have forsaken many legal privileges as well. As which privileges are still held and which are forsaken... well, that's a question for a lawyer.

    • (Score: 3, Informative) by The Mighty Buzzard on Wednesday July 31 2019, @10:40PM

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Wednesday July 31 2019, @10:40PM (#873761) Homepage Journal

      Learn to read and learn copyright law. For starters we're not talking about anything you have to pay to see, we're talking shit put out for anyone to view at no cost. Also, you can sit there and make as many copies of Harry Potter as you like every day for a year and nothing will happen to you. Distributing them is what will get you in trouble. Running word count analysis on them will not.

      --
      My rights don't end where your fear begins.
    • (Score: 3, Insightful) by jbruchon on Thursday August 01 2019, @03:14AM

      by jbruchon (4473) on Thursday August 01 2019, @03:14AM (#873849) Homepage

      It's amazing how you got "make a copy (which violates the owner's copyright)" out of "send an HTTP request to the owner for a copy and receive a copy from the owner (you know, the person who's explicitly allowed to make copies and send them to people)." Not a lawyer? If you are, I'd predict that you won't be for long. Next thing you know, you'll be arguing that ad blocking is theft because somehow it's illegal to NOT go get other data after you have already been given a copy of the data by the owner of the data.

      --
      I'm just here to listen to the latest song about butts.
  • (Score: 2) by edIII on Wednesday July 31 2019, @07:47PM

    by edIII (791) on Wednesday July 31 2019, @07:47PM (#873699)

    That other discussion though is worth being had. I've been involved in things like this, and always involving publicly owned data. This reminds me of certain cities trying to sue people claiming they own the train schedules and are the only ones who can publish the data, which is complete and utter anti-American fascist bullshit.

    The reason why they want to maintain control is because pre-Internet they charged to print the data, and then later tried charging a couple hundred dollars for a FTP account to access the data. Even though, the website makes all the information easily and cheaply available. The web scrapers aren't even taking that many resources.

    In this specific situation, I recommend fighting the fucking bastards in the government tooth and nail. They don't deserve absolute control of what is publicly owned data.

    --
    Technically, lunchtime is at any moment. It's just a wave function.