Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
Meta
posted by janrinok on Thursday July 24, @10:15AM   Printer-friendly

Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.

Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.

Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.

However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.

If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by zocalo on Thursday July 24, @04:47PM (1 child)

    by zocalo (302) on Thursday July 24, @04:47PM (#1411311)
    Yeah, it's a choice, and not all options work for everyone. If you're cost/resource contrained, then the best options are definitely blocking or tarpitting (the latter is quite low resource if done right as it's just a *lot* of very, *very*, slow connections). A Raspberry Pi has more than enough grunt for a decent tarpit, but a low-end VM or older bare-metal server will do just fine too.

    For poisoning, we've used a few approaches, but the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers, either co-hosted on-prem/cloud with the actual servers, or to servers we host - Garbage as a Service / GaaS; you can probably work out some of the marketing. :) Once there, it's not a particularly resource heavy workload but you're basically giving them all the crap they can scape, at whatever bandwidth you can manage, until (or if!) the bot works out something is up, so not a great option if you're paying by the TB for outgoing traffic unless you really don't care about the cost. CPU resources depend on what you're generating and in what volumes, but it's really just a slightly more sophisticated version of "Lorum Ipsum" for text and/or random paragraphs of text drawn from out of copyright works from Project Gutenberg or similar, into which you insert images that can come from any free-to-use image/clipart collection(s) you can find and some links to the script that generates more garbage. The bots don't grok nonsensical out-of-context content; they're just scraping it, not trying to parse it.
    --
    UNIX? They're not even circumcised! Savages!
    Starting Score:    1  point
    Moderation   +3  
       Informative=3, Total=3
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 3, Informative) by fliptop on Thursday July 24, @09:57PM

    by fliptop (1666) on Thursday July 24, @09:57PM (#1411353) Journal

    the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers

    I have RewriteRule entries in httpd.conf similar to this:

    RewriteRule ^/(.*wp-admin.*)$ https://wordpress.com/$1 [wordpress.com] [L,R]

    which takes care of a lot of the bots looking for wordpress vulnerabilities. For other vulnerability scans, like POODLE, I just redirect to the appropriate CVE page, router exploits go to the router manufacturer's page, scans for shell access or /etc/passwd go to /dev/null. There's more but you get the picture.

    The most offensive cloud providers, in my experience, and in this order, are Google, Microsoft, AWS, Akamai and Oracle. There's a few smaller providers like Hurricane Electric, FranTech Solutions and PSInet that are persistent and bothersome too. I see others in there on occasion but Google is definitely out of control, especially the stuff they host in the 34.64.0.0/10 CIDR.

    Over the years I've added hundreds of thousands of IP addresses to my firewall, in some cases whole countries are blocked (yes Belarus, you're in there).

    --
    Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.