Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.

Submission Preview

Link to Story

Abusive AI Crawlers Run Up Large Bandwidth Bills for their Targets

Accepted submission by canopic jug at 2024-07-30 07:17:57 from the externalities dept.
Techonomics

An increasing number of sites are reporting about increased bandwidth being lost to AI crawlers. The documentation sharing site, Read the Docs [readthedocs.com], has an analysis of the attacks against it by AI crawlers [readthedocs.com]. Several examples are included.

We have been seeing a number of bad crawlers over the past few months, but here are a couple illustrative examples of the abuse we're seeing:

73 TB in May 2024 from one crawler

One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.

[...] This was a bug in their crawler that was causing it to download the same files over and over again. There was no bandwidth limiting in place, or support for Etags and Last-Modified headers which would have allowed the crawler to only download files that had changed. We have reported this issue to them, and hopefully the issue will be fixed.

Many of the bots even ignore the robots.txt file and its contents.


Original Submission