SoylentNews
SoylentNews is people
https://soylentnews.org/

Title    Abusive AI Crawlers Run Up Large Bandwidth Bills for their Targets
Date    Wednesday July 31 2024, @09:40PM
Author    Fnord666
Topic   
from the externalities dept.
https://soylentnews.org/article.pl?sid=24/07/30/1337210

canopic jug writes:

An increasing number of sites are reporting about increased bandwidth being lost to AI crawlers. The documentation sharing site, Read the Docs, has an analysis of the attacks against it by AI crawlers. Several examples are included.

We have been seeing a number of bad crawlers over the past few months, but here are a couple illustrative examples of the abuse we're seeing:

73 TB in May 2024 from one crawler

One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.

[...] This was a bug in their crawler that was causing it to download the same files over and over again. There was no bandwidth limiting in place, or support for Etags and Last-Modified headers which would have allowed the crawler to only download files that had changed. We have reported this issue to them, and hopefully the issue will be fixed.

Many of the bots even ignore the robots.txt file and its contents.


Original Submission

Links

  1. "canopic jug" - https://soylentnews.org/~canopic+jug/
  2. "Read the Docs" - https://about.readthedocs.com/
  3. "analysis of the attacks against it by AI crawlers" - https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
  4. "Original Submission" - https://soylentnews.org/submit.pl?op=viewsub&subid=63387

© Copyright 2025 - SoylentNews, All Rights Reserved

printed from SoylentNews, Abusive AI Crawlers Run Up Large Bandwidth Bills for their Targets on 2025-07-07 00:45:50