An increasing number of sites are reporting about increased bandwidth being lost to AI crawlers. The documentation sharing site, Read the Docs, has an analysis of the attacks against it by AI crawlers. Several examples are included.
We have been seeing a number of bad crawlers over the past few months, but here are a couple illustrative examples of the abuse we're seeing:
73 TB in May 2024 from one crawler
One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.
[...] This was a bug in their crawler that was causing it to download the same files over and over again. There was no bandwidth limiting in place, or support for Etags and Last-Modified headers which would have allowed the crawler to only download files that had changed. We have reported this issue to them, and hopefully the issue will be fixed.
Many of the bots even ignore the robots.txt file and its contents.
(Score: 4, Funny) by Tork on Wednesday July 31 2024, @11:01PM
Holy shit! That's so unlike them!! 🙄
🏳️🌈 Proud Ally 🏳️🌈
(Score: 4, Touché) by c0lo on Wednesday July 31 2024, @11:42PM (3 children)
Naturally stupid company builds corpus to train artificial intelligence. What can go wrong?
https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
(Score: 1, Funny) by Anonymous Coward on Thursday August 01 2024, @05:51AM (2 children)
(Score: 4, Touché) by Thexalon on Thursday August 01 2024, @10:50AM
Why bother with subtlety, when odds are pretty good that they'll gladly slurp up complete nonsense? All you need to do is have total nonsense appear in enough places, and a lot of machines (and people) start taking it seriously.
If you don't think that works, watch the process of fringe legal theories suddenly make their way into Supreme Court decisions.
"Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
(Score: 4, Informative) by ls671 on Thursday August 01 2024, @01:57PM
They can slow down some sites too. I just return a 403
I keep adding to the list and mod_security already blocks some by default.
always blocked:
BOT/0.1 (BOT for JCE)
BorneoBot/0.5.0
Seekport Crawler
SeznamBot/3.2
webgains-bot
coccocbot
nbertaupete
MojeekBot
DF Bot
PetalBot
gdnplus.com
Translation-Search-Machine
dataforseo.com
dataforseo-bot
is.gd/hmbg1a
cincrawdata.net
Cincraw
www.qwant.com
Baispider
bai.com
SEOkicks
seokicks.de
SurdotlyBot
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36
BLEXBot
SMTBot
about.censys.io
CensysInspect
seekport.com
webmeup-crawler
lua-resty-http
serpstatbot.com
ALittle Client
VelenPublicWebCrawler
Amazonbot
megaindex.com
spider-feedback
awario.com
DotBot
webprosbot
test-bot
PubMatic Crawler Bot
PetalBot
Go-http-client
Bytespider
Timpibot
ClaudeBot
FriendlyCrawler
ImagesiftBot
getodin.com
BitSightBot
GPTBot
openai.com
Custom-AsyncHttpClient
---------------------------------------------------------------------------
Only allowed between 12 et 7 AM:
Googlebot
zoominfobot
Applebot
bingbot
Adsbot
MauiBot
dotbot
Googlebot-Image
YandexBot
Bytespider
evc-batch
The Knowledge AI
YandexImages
Barkrowler
DuckDuckBot
Facebot
facebookexternalhit
Everything I write is lies, including this sentence.
(Score: 2) by drussell on Thursday August 01 2024, @03:09AM (10 children)
73TB of data transfer cost them $5000?
Perhaps they need to upgrade their cell phone data plan or whatever they were using for connectivity. ;-)
To me, that still doesn't seem to add up on any kind of commercial, wholesale data connection...
Don't get me wrong, that is an insane amount of data and means it is an average usage of what, 25 MB/sec of constant data over the month, but that still shouldn't cost $5000.
Isn't an un-metered 1Gbps commercial typically a few hundred per month in most places now? I suppose in some places you might still pay a hefty premium for upstream bandwidth?
(Score: 5, Interesting) by MostCynical on Thursday August 01 2024, @03:50AM
https://learnaws.io/aws-calculator/s3 [learnaws.io]
plugging random numbers..
How much storage do you need per month? 50 GB per month
How many times would you upload or list files? 1000
How many downloads would you perform? 2000
What'll be your total download size every month? 73 TB per month
Estimated S3 Standard Cost
S3 Standard storage cost: $1.15
S3 Standard PUT requests cost: $0.01
S3 Standard GET requests cost: $0.00
S3 Standard data transfer out cost: $6727.68
Total AWS S3 costs: $6728.84/month
"I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
(Score: 4, Informative) by dwilson98052 on Thursday August 01 2024, @08:21AM
If you're actually hosting your own hardware and have a lease line or have your equipment in a colo you might be right, but cloud bullshit isn't just somebody elses computer, it's expensive as hell too.
(Score: 0) by Anonymous Coward on Thursday August 01 2024, @12:36PM
> Isn't an un-metered 1Gbps commercial typically a few hundred per month in most places now?
Not any place that Comcast is the only game in town.
Silicon Valley's East Bay connectivity still sucks.
(Score: 3, Interesting) by Freeman on Thursday August 01 2024, @02:45PM (6 children)
Try downloading 73TB of data every month and see how long your ISP supports your habit.
Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
(Score: 4, Informative) by janrinok on Thursday August 01 2024, @03:06PM (5 children)
For some people that is just their pron!
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 1, Insightful) by Anonymous Coward on Friday August 02 2024, @12:24AM (3 children)
They're archivists saving porn for the future generations. 😉
They can't be watching that much of it:
70TB/31 days = 209 megabits per second.
(Score: 2) by Tork on Friday August 02 2024, @12:35AM (2 children)
Umm... yah, think about that for a minute.
🏳️🌈 Proud Ally 🏳️🌈
(Score: 5, Funny) by janrinok on Friday August 02 2024, @01:12AM (1 child)
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 3, Funny) by Tork on Friday August 02 2024, @03:22AM
🏳️🌈 Proud Ally 🏳️🌈
(Score: 3, Funny) by Tork on Friday August 02 2024, @12:34AM
pftbt, amateurs.
🏳️🌈 Proud Ally 🏳️🌈