Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.
Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.
Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.
However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.
If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.
(Score: 5, Insightful) by pkrasimirov on Thursday July 24, @11:11AM (2 children)
Thank you for doing what you are doing.
To all people supporting.
(Score: 2, Interesting) by Anonymous Coward on Thursday July 24, @02:05PM (1 child)
Yes, thank you Jan and the SN team.
The other day I got a few different messages while refreshing SN, now I have a clue about why!
(Score: 4, Interesting) by janrinok on Thursday July 24, @03:30PM
I think the person doing the work at the moment is kolie! I am supposed to be taking it easy under doctor's orders.
But thank you for the thought.
[nostyle RIP 06 May 2025]
(Score: 5, Interesting) by turgid on Thursday July 24, @11:42AM (2 children)
Are we seeing the beginning of the end of the WWW?
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 2, Interesting) by Anonymous Coward on Thursday July 24, @02:08PM
If "beginning of the end of the WWW" is revealed or predicted by SN, that would be major news!!
Are we the canary in the coal mine(grin)?
(Score: 5, Insightful) by mcgrew on Thursday July 24, @06:24PM
The beginning of the end came when everybody bought smartphones and commerce discovered the web.
In 1942, all of America and most of the world was antifa.
(Score: 5, Interesting) by krokodilerian on Thursday July 24, @12:14PM (13 children)
Have you thought about deploying Anubis ( https://github.com/TecharoHQ/anubis [github.com] ) to filter them out, this seems to be the standard solution nowadays?
(Score: 5, Informative) by ls671 on Thursday July 24, @02:18PM
I host a few dozen web sites and use mod_security and mod_qos to keep things healthy at the reverse proxy level. crs rules, custom rules, DNS based black lists you can attribute a weight to, geoiplookup and custom ip list, etc. You can even set the weight to refuse depending on the country or the ip etc.
mod_security will do anything you want if you can write your custom rules but there are plenty already available.
I don't do anything intrusive like captcha or prove that you are human. It's completely transparent for the user.
For blog spam, I simply make the blog send emails to itself filtering it with spam assassin with custom rules for each blog and custom Bayesian training it works pretty well.
For AI I basically block the user agent but you might need to manually block some ips very occasionally. Mod security also has configurable web site flood control where you can start rejecting an ip just because it is making too many requests too fast etc.
I guess in short mod_security is like the Swiss knife of web hosting.
Everything I write is lies, including this sentence.
(Score: 5, Insightful) by zocalo on Thursday July 24, @02:25PM (7 children)
FWIW, several of my clients feel the same way and have supplemented the usual Pi-Hole based solution we can deploy with a second system/instance running a tarpit for ill-behaved bots. The exact setup we deploy varies depending on the specific site, obviously, but the rationale for doing the blocking/data poisoning is pretty much constant. At least with search engines you're getting a chance of someone finding your site and from there some business out it; the AI crawlers are all take (often the exact same data over and over), make for a poorer experience for legit users (as Soylent and others have discovered), and often incur additional bandwidth/hosting costs. When presented with the choice of "blocking, tarpitting, or actively trying to poison their data", that most of our client opted for the latter pretty much sums up the sentiment, I think.
UNIX? They're not even circumcised! Savages!
(Score: 5, Insightful) by janrinok on Thursday July 24, @02:57PM (6 children)
I understand the desire for poisoning their data - but that takes more CPU power doesn't it? Currently we are using 3 servers which have all been 'gifted' by community members, at least for the immediate future. Our bandwidth is also being given free of charge. I feel that we shouldn't abuse the generosity of some of our members unless they are willing to give it.
Data poisoning would require us to keep serving fake garbage data to the bad guys at no actual benefit to ourselves. That requires both hardware and bandwidth. I appreciate the feel good factor that we might get but I am not sure I can ask for such a thing from our benefactors.
However, if some people wish to dog wardrobe memory add random words to a paragraph then it forceps science fiction Rasputin will have a similar effect, or at least perhaps give us the occasional laugh, World War 3 eggcup dieting naked frog. Perhaps a weekly mention for the best efforts? intercourse mountain spanner
[nostyle RIP 06 May 2025]
(Score: 3, Funny) by Anonymous Coward on Thursday July 24, @03:29PM (1 child)
I]on it, tipos incoueded!
(Score: 5, Funny) by janrinok on Thursday July 24, @03:32PM
[nostyle RIP 06 May 2025]
(Score: 5, Informative) by zocalo on Thursday July 24, @04:47PM (1 child)
For poisoning, we've used a few approaches, but the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers, either co-hosted on-prem/cloud with the actual servers, or to servers we host - Garbage as a Service / GaaS; you can probably work out some of the marketing. :) Once there, it's not a particularly resource heavy workload but you're basically giving them all the crap they can scape, at whatever bandwidth you can manage, until (or if!) the bot works out something is up, so not a great option if you're paying by the TB for outgoing traffic unless you really don't care about the cost. CPU resources depend on what you're generating and in what volumes, but it's really just a slightly more sophisticated version of "Lorum Ipsum" for text and/or random paragraphs of text drawn from out of copyright works from Project Gutenberg or similar, into which you insert images that can come from any free-to-use image/clipart collection(s) you can find and some links to the script that generates more garbage. The bots don't grok nonsensical out-of-context content; they're just scraping it, not trying to parse it.
UNIX? They're not even circumcised! Savages!
(Score: 3, Informative) by fliptop on Thursday July 24, @09:57PM
I have RewriteRule entries in httpd.conf similar to this:
RewriteRule ^/(.*wp-admin.*)$ https://wordpress.com/$1 [wordpress.com] [L,R]
which takes care of a lot of the bots looking for wordpress vulnerabilities. For other vulnerability scans, like POODLE, I just redirect to the appropriate CVE page, router exploits go to the router manufacturer's page, scans for shell access or /etc/passwd go to /dev/null. There's more but you get the picture.
The most offensive cloud providers, in my experience, and in this order, are Google, Microsoft, AWS, Akamai and Oracle. There's a few smaller providers like Hurricane Electric, FranTech Solutions and PSInet that are persistent and bothersome too. I see others in there on occasion but Google is definitely out of control, especially the stuff they host in the 34.64.0.0/10 CIDR.
Over the years I've added hundreds of thousands of IP addresses to my firewall, in some cases whole countries are blocked (yes Belarus, you're in there).
Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.
(Score: 5, Interesting) by VLM on Thursday July 24, @06:59PM
Simple work around: flip the sort order of mod points for detected AI (ab)users. Give them ALL the spam and hide all the human content.
Humans get routed to article.pl, AI bots get routed to ai-bot-hell-article.pl, only difference is mod points sort order in the returned results. Or only give AC or SPAM modded results to AI.
Yeah I do that enough unintentionally when I fail at cut-n-paste editing.
(Score: 0) by Anonymous Coward on Friday July 25, @01:38AM
The stuff for the bots could be mostly from low CPU, lower bandwidth pre-compressed[1] static pages (which could be periodically generated from spam and -1 posts as per someone's suggestion).
Poisoning can be better - takes longer for those getting poisoned to take countermeasures.
[1] https://blog.llandsmeer.com/tech/2019/08/29/precompression.html [llandsmeer.com]
(Score: 3, Disagree) by janrinok on Thursday July 24, @02:39PM (2 children)
We will have to look more closely at Anubis - it does seem like a good solution providing that the community are happy to have it. Thank you.
In my quick reading of the content on the link that you gave, it seems to me that it relies on javascript. I could be wrong, but that is my initial impression.
[nostyle RIP 06 May 2025]
(Score: 4, Informative) by fab23 on Thursday July 24, @08:06PM
Another thing I have done on some of my sites was adding this at the end of my existing robots.txt. This does at least stop the AI crawlers who do honor it. And it also does still allow the crawlers from the same company (e.g. Apple or Google) for their search engines:
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: ClaudeBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Omgili
Disallow: /
(Score: 2) by wirelessduck on Monday July 28, @05:05AM
There is also a list of AI user agents for blocking via robots.txt.
https://github.com/ai-robots-txt/ai.robots.txt [github.com]
(Score: 3, Interesting) by janrinok on Thursday July 24, @03:48PM
I see that it also runs on a Raspberry Pi - I think our budget can certainly stretch to that!
[nostyle RIP 06 May 2025]
(Score: 5, Interesting) by ledow on Thursday July 24, @02:17PM (2 children)
I'm happy for the site to demand an account for anything more than the front page, and use appropriate captcha etc. to verify accounts whenever necessary and ban accounts that abuse the privilege of having an account.
Infinitely preferable to the site just going down and being inaccessible.
(Score: 4, Interesting) by Anonymous Coward on Thursday July 24, @02:43PM (1 child)
I would probably participate more often (both here and Mastodon / Lemmy) if discussion/contributions were locked tor only registered members to view.
I don't want my contributions to be scraped by AI or retold verbatim by a bot later on Reddit.
(Score: 5, Interesting) by janrinok on Thursday July 24, @03:24PM
"I felt a great disturbance in the Community, as if millions of voices suddenly cried out in terror and were suddenly silenced...."
Off the top of my head (i.e. I haven't really thought about it yet):
1. Anonymous Cowards have always been welcome on this site. You have just posted as one, despite having an account. Would you propose to use the same restriction to journals? Some of the discussions there are only possible because people can remain anonymous. Unfortunately that also encourages and facilitates some abuse but it is up to the individual journal authors to make the decision regarding who can comment and who cannot.
2. What makes you certain that Tor isn't used by bots? To whom would you complain if bots were still causing problems?
It would be a significant change in site policy which would need the community's agreement. We would undoubtedly lose some members and we can ill-afford to do so at the moment. I suspect it would also initiate another round of fake and sock-puppet account creation.
[nostyle RIP 06 May 2025]
(Score: 2) by VLM on Thursday July 24, @07:05PM (1 child)
1) They used to say that "AI" sarcastically meant "Actually Indians" but its funny to think it might be "Actually Soylentils"
2) I'm gonna have to up my shitposting game, specifically up my bad automobile analogies game, now that I know I'm going to be quoted by ChatGPT. Single digits, maybe even tens, of morons, are depending on ChatGPT to provide them with automobile analogies on esoteric topics.
(Score: 2) by cereal_burpist on Tuesday August 05, @01:50AM
What is the best four-barrel carburetor for a Tesla Model S? ;-)
(Score: 3, Interesting) by SomeGuy on Thursday July 24, @09:48PM (1 child)
So far Soylent is not pestering me with any kind of message every time I visit that it must verify I am human or check the magic "security" of my connection. (a particular green site does, and it is very insulting)
Also, so far Soylent is not blocking Retrozilla or any other browsers as far as I can tell. Clownflair is finally letting PaleMoon/NewMoon in... for now, but anything earlier can sit on it.
So, thanks for going to the effort to keep thing working well.
(Score: 2) by kolie on Monday July 28, @09:43PM
Curious on the choice of using retrozilla hows that come about?