Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by Fnord666 on Wednesday July 31 2024, @09:40PM   Printer-friendly
from the externalities dept.

An increasing number of sites are reporting about increased bandwidth being lost to AI crawlers. The documentation sharing site, Read the Docs, has an analysis of the attacks against it by AI crawlers. Several examples are included.

We have been seeing a number of bad crawlers over the past few months, but here are a couple illustrative examples of the abuse we're seeing:

73 TB in May 2024 from one crawler

One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.

[...] This was a bug in their crawler that was causing it to download the same files over and over again. There was no bandwidth limiting in place, or support for Etags and Last-Modified headers which would have allowed the crawler to only download files that had changed. We have reported this issue to them, and hopefully the issue will be fixed.

Many of the bots even ignore the robots.txt file and its contents.


Original Submission

This discussion was created by Fnord666 (652) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Funny) by Tork on Wednesday July 31 2024, @11:01PM

    by Tork (3914) Subscriber Badge on Wednesday July 31 2024, @11:01PM (#1366528)

    As a major host of open source documentation, we'd love to work with these companies on a deal to crawl our site respectfully. We could build an integration that would alert them to content changes, and download the files that have changed. However, none of these companies have reached out to us, except in response to abuse reports.

    Holy shit! That's so unlike them!! 🙄

    --
    🏳️‍🌈 Proud Ally 🏳️‍🌈
  • (Score: 4, Touché) by c0lo on Wednesday July 31 2024, @11:42PM (3 children)

    by c0lo (156) Subscriber Badge on Wednesday July 31 2024, @11:42PM (#1366534) Journal

    This was a bug in their crawler that was causing it to download the same files over and over again.

    Naturally stupid company builds corpus to train artificial intelligence. What can go wrong?

    --
    https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
    • (Score: 1, Funny) by Anonymous Coward on Thursday August 01 2024, @05:51AM (2 children)

      by Anonymous Coward on Thursday August 01 2024, @05:51AM (#1366567)
      Sites start giving abusive crawlers subtly fake information? Fake enough to cause problems, but subtle enough to be hard to detect?
      • (Score: 4, Touché) by Thexalon on Thursday August 01 2024, @10:50AM

        by Thexalon (636) on Thursday August 01 2024, @10:50AM (#1366588)

        Why bother with subtlety, when odds are pretty good that they'll gladly slurp up complete nonsense? All you need to do is have total nonsense appear in enough places, and a lot of machines (and people) start taking it seriously.

        If you don't think that works, watch the process of fringe legal theories suddenly make their way into Supreme Court decisions.

        --
        "Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
      • (Score: 4, Informative) by ls671 on Thursday August 01 2024, @01:57PM

        by ls671 (891) on Thursday August 01 2024, @01:57PM (#1366604) Homepage

        They can slow down some sites too. I just return a 403

        I keep adding to the list and mod_security already blocks some by default.

        always blocked:
        BOT/0.1 (BOT for JCE)
        BorneoBot/0.5.0
        Seekport Crawler
        SeznamBot/3.2
        webgains-bot
        coccocbot
        nbertaupete
        MojeekBot
        DF Bot
        PetalBot
        gdnplus.com
        Translation-Search-Machine
        dataforseo.com
        dataforseo-bot
        is.gd/hmbg1a
        cincrawdata.net
        Cincraw
        www.qwant.com
        Baispider
        bai.com
        SEOkicks
        seokicks.de
        SurdotlyBot
        Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36
        BLEXBot
        SMTBot
        about.censys.io
        CensysInspect
        seekport.com
        webmeup-crawler
        lua-resty-http
        serpstatbot.com
        ALittle Client
        VelenPublicWebCrawler
        Amazonbot
        megaindex.com
        spider-feedback
        awario.com
        DotBot
        webprosbot
        test-bot
        PubMatic Crawler Bot
        PetalBot
        Go-http-client
        Bytespider
        Timpibot
        ClaudeBot
        FriendlyCrawler
        ImagesiftBot
        getodin.com
        BitSightBot
        GPTBot
        openai.com
        Custom-AsyncHttpClient

        ---------------------------------------------------------------------------

        Only allowed between 12 et 7 AM:
        Googlebot
        zoominfobot
        Applebot
        bingbot
        Adsbot
        MauiBot
        dotbot
        Googlebot-Image
        YandexBot
        Bytespider
        evc-batch
        The Knowledge AI
        YandexImages
        Barkrowler
        DuckDuckBot
        Facebot
        facebookexternalhit

        --
        Everything I write is lies, including this sentence.
  • (Score: 2) by drussell on Thursday August 01 2024, @03:09AM (10 children)

    by drussell (2678) Subscriber Badge on Thursday August 01 2024, @03:09AM (#1366555) Journal

    One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges

    73TB of data transfer cost them $5000?

    Perhaps they need to upgrade their cell phone data plan or whatever they were using for connectivity. ;-)

    To me, that still doesn't seem to add up on any kind of commercial, wholesale data connection...

    Don't get me wrong, that is an insane amount of data and means it is an average usage of what, 25 MB/sec of constant data over the month, but that still shouldn't cost $5000.

    Isn't an un-metered 1Gbps commercial typically a few hundred per month in most places now? I suppose in some places you might still pay a hefty premium for upstream bandwidth?

    • (Score: 5, Interesting) by MostCynical on Thursday August 01 2024, @03:50AM

      by MostCynical (2589) on Thursday August 01 2024, @03:50AM (#1366556) Journal

      https://learnaws.io/aws-calculator/s3 [learnaws.io]

      plugging random numbers..
      How much storage do you need per month? 50 GB per month
      How many times would you upload or list files? 1000
      How many downloads would you perform? 2000
      What'll be your total download size every month? 73 TB per month
      Estimated S3 Standard Cost

      S3 Standard storage cost: $1.15
      S3 Standard PUT requests cost: $0.01
      S3 Standard GET requests cost: $0.00
      S3 Standard data transfer out cost: $6727.68
      Total AWS S3 costs: $6728.84/month

      --
      "I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
    • (Score: 4, Informative) by dwilson98052 on Thursday August 01 2024, @08:21AM

      by dwilson98052 (17613) on Thursday August 01 2024, @08:21AM (#1366572)

      If you're actually hosting your own hardware and have a lease line or have your equipment in a colo you might be right, but cloud bullshit isn't just somebody elses computer, it's expensive as hell too.

    • (Score: 0) by Anonymous Coward on Thursday August 01 2024, @12:36PM

      by Anonymous Coward on Thursday August 01 2024, @12:36PM (#1366597)

      > Isn't an un-metered 1Gbps commercial typically a few hundred per month in most places now?

      Not any place that Comcast is the only game in town.

      Silicon Valley's East Bay connectivity still sucks.

    • (Score: 3, Interesting) by Freeman on Thursday August 01 2024, @02:45PM (6 children)

      by Freeman (732) on Thursday August 01 2024, @02:45PM (#1366608) Journal

      Try downloading 73TB of data every month and see how long your ISP supports your habit.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 4, Informative) by janrinok on Thursday August 01 2024, @03:06PM (5 children)

        by janrinok (52) Subscriber Badge on Thursday August 01 2024, @03:06PM (#1366610) Journal

        For some people that is just their pron!

        --
        I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
        • (Score: 1, Insightful) by Anonymous Coward on Friday August 02 2024, @12:24AM (3 children)

          by Anonymous Coward on Friday August 02 2024, @12:24AM (#1366672)

          They're archivists saving porn for the future generations. 😉

          They can't be watching that much of it:

          4K Ultra HD Blu-ray discs have a maximum video bitrate of 128 Mbps

          70TB/31 days = 209 megabits per second.

          • (Score: 2) by Tork on Friday August 02 2024, @12:35AM (2 children)

            by Tork (3914) Subscriber Badge on Friday August 02 2024, @12:35AM (#1366676)

            They can't be watching that much of it:

            Umm... yah, think about that for a minute.

            --
            🏳️‍🌈 Proud Ally 🏳️‍🌈
            • (Score: 5, Funny) by janrinok on Friday August 02 2024, @01:12AM (1 child)

              by janrinok (52) Subscriber Badge on Friday August 02 2024, @01:12AM (#1366682) Journal
              Everybody fast-forwards the bit where he actually fixes the washing machine.....
              --
              I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
              • (Score: 3, Funny) by Tork on Friday August 02 2024, @03:22AM

                by Tork (3914) Subscriber Badge on Friday August 02 2024, @03:22AM (#1366696)
                Yah, there's a great tutorial on laying pipe!
                --
                🏳️‍🌈 Proud Ally 🏳️‍🌈
        • (Score: 3, Funny) by Tork on Friday August 02 2024, @12:34AM

          by Tork (3914) Subscriber Badge on Friday August 02 2024, @12:34AM (#1366675)

          For some people that is just their pron!

          pftbt, amateurs.

          --
          🏳️‍🌈 Proud Ally 🏳️‍🌈
(1)