Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.
Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.
Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.
However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.
If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.
(Score: 5, Informative) by zocalo on Thursday July 24, @04:47PM (1 child)
For poisoning, we've used a few approaches, but the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers, either co-hosted on-prem/cloud with the actual servers, or to servers we host - Garbage as a Service / GaaS; you can probably work out some of the marketing. :) Once there, it's not a particularly resource heavy workload but you're basically giving them all the crap they can scape, at whatever bandwidth you can manage, until (or if!) the bot works out something is up, so not a great option if you're paying by the TB for outgoing traffic unless you really don't care about the cost. CPU resources depend on what you're generating and in what volumes, but it's really just a slightly more sophisticated version of "Lorum Ipsum" for text and/or random paragraphs of text drawn from out of copyright works from Project Gutenberg or similar, into which you insert images that can come from any free-to-use image/clipart collection(s) you can find and some links to the script that generates more garbage. The bots don't grok nonsensical out-of-context content; they're just scraping it, not trying to parse it.
UNIX? They're not even circumcised! Savages!
(Score: 3, Informative) by fliptop on Thursday July 24, @09:57PM
I have RewriteRule entries in httpd.conf similar to this:
RewriteRule ^/(.*wp-admin.*)$ https://wordpress.com/$1 [wordpress.com] [L,R]
which takes care of a lot of the bots looking for wordpress vulnerabilities. For other vulnerability scans, like POODLE, I just redirect to the appropriate CVE page, router exploits go to the router manufacturer's page, scans for shell access or /etc/passwd go to /dev/null. There's more but you get the picture.
The most offensive cloud providers, in my experience, and in this order, are Google, Microsoft, AWS, Akamai and Oracle. There's a few smaller providers like Hurricane Electric, FranTech Solutions and PSInet that are persistent and bothersome too. I see others in there on occasion but Google is definitely out of control, especially the stuff they host in the 34.64.0.0/10 CIDR.
Over the years I've added hundreds of thousands of IP addresses to my firewall, in some cases whole countries are blocked (yes Belarus, you're in there).
Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.