Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.
Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.
Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.
However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.
If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.
(Score: 3, Disagree) by janrinok on Thursday July 24, @02:39PM (2 children)
We will have to look more closely at Anubis - it does seem like a good solution providing that the community are happy to have it. Thank you.
In my quick reading of the content on the link that you gave, it seems to me that it relies on javascript. I could be wrong, but that is my initial impression.
[nostyle RIP 06 May 2025]
(Score: 4, Informative) by fab23 on Thursday July 24, @08:06PM
Another thing I have done on some of my sites was adding this at the end of my existing robots.txt. This does at least stop the AI crawlers who do honor it. And it also does still allow the crawlers from the same company (e.g. Apple or Google) for their search engines:
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: ClaudeBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Omgili
Disallow: /
(Score: 2) by wirelessduck on Monday July 28, @05:05AM
There is also a list of AI user agents for blocking via robots.txt.
https://github.com/ai-robots-txt/ai.robots.txt [github.com]