Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
Meta
posted by janrinok on Thursday July 24, @10:15AM   Printer-friendly

Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.

Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.

Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.

However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.

If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by krokodilerian on Thursday July 24, @12:14PM (13 children)

    by krokodilerian (6979) on Thursday July 24, @12:14PM (#1411273)

    Have you thought about deploying Anubis ( https://github.com/TecharoHQ/anubis [github.com] ) to filter them out, this seems to be the standard solution nowadays?

    Starting Score:    1  point
    Moderation   +3  
       Interesting=2, Informative=1, Disagree=1, Total=4
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 5, Informative) by ls671 on Thursday July 24, @02:18PM

    by ls671 (891) Subscriber Badge on Thursday July 24, @02:18PM (#1411284) Homepage

    I host a few dozen web sites and use mod_security and mod_qos to keep things healthy at the reverse proxy level. crs rules, custom rules, DNS based black lists you can attribute a weight to, geoiplookup and custom ip list, etc. You can even set the weight to refuse depending on the country or the ip etc.

    mod_security will do anything you want if you can write your custom rules but there are plenty already available.

    I don't do anything intrusive like captcha or prove that you are human. It's completely transparent for the user.

    For blog spam, I simply make the blog send emails to itself filtering it with spam assassin with custom rules for each blog and custom Bayesian training it works pretty well.

    For AI I basically block the user agent but you might need to manually block some ips very occasionally. Mod security also has configurable web site flood control where you can start rejecting an ip just because it is making too many requests too fast etc.

    I guess in short mod_security is like the Swiss knife of web hosting.

    --

    Everything I write is lies, including this sentence.
  • (Score: 5, Insightful) by zocalo on Thursday July 24, @02:25PM (7 children)

    by zocalo (302) on Thursday July 24, @02:25PM (#1411285)
    TFS does imply they have some kind of Fail2Ban type solution in place but, if that doesn't deter them (which I suspect is quite likely), then I think it's fair to say "game on!" and starting to actively tarpit or poison their data is the next step to take. "robots.txt" is essentially laying out the house rules; if they're chosing to ignore that then IMHO they deserve whatever they get. In a nightclub, that usually mean the bouncers and either being shown the door/banned, and potentially getting physically removed from the premises. Our equivalent online options are blocking, tarpitting, and data poisoning. It's their choice whether to behave or not.

    FWIW, several of my clients feel the same way and have supplemented the usual Pi-Hole based solution we can deploy with a second system/instance running a tarpit for ill-behaved bots. The exact setup we deploy varies depending on the specific site, obviously, but the rationale for doing the blocking/data poisoning is pretty much constant. At least with search engines you're getting a chance of someone finding your site and from there some business out it; the AI crawlers are all take (often the exact same data over and over), make for a poorer experience for legit users (as Soylent and others have discovered), and often incur additional bandwidth/hosting costs. When presented with the choice of "blocking, tarpitting, or actively trying to poison their data", that most of our client opted for the latter pretty much sums up the sentiment, I think.
    --
    UNIX? They're not even circumcised! Savages!
    • (Score: 5, Insightful) by janrinok on Thursday July 24, @02:57PM (6 children)

      by janrinok (52) Subscriber Badge on Thursday July 24, @02:57PM (#1411289) Journal

      I understand the desire for poisoning their data - but that takes more CPU power doesn't it? Currently we are using 3 servers which have all been 'gifted' by community members, at least for the immediate future. Our bandwidth is also being given free of charge. I feel that we shouldn't abuse the generosity of some of our members unless they are willing to give it.

      Data poisoning would require us to keep serving fake garbage data to the bad guys at no actual benefit to ourselves. That requires both hardware and bandwidth. I appreciate the feel good factor that we might get but I am not sure I can ask for such a thing from our benefactors.

      However, if some people wish to dog wardrobe memory add random words to a paragraph then it forceps science fiction Rasputin will have a similar effect, or at least perhaps give us the occasional laugh, World War 3 eggcup dieting naked frog. Perhaps a weekly mention for the best efforts? intercourse mountain spanner

      --
      [nostyle RIP 06 May 2025]
      • (Score: 3, Funny) by Anonymous Coward on Thursday July 24, @03:29PM (1 child)

        by Anonymous Coward on Thursday July 24, @03:29PM (#1411294)

        I]on it, tipos incoueded!

        • (Score: 5, Funny) by janrinok on Thursday July 24, @03:32PM

          by janrinok (52) Subscriber Badge on Thursday July 24, @03:32PM (#1411296) Journal
          That's cheating - you always type like that!
          --
          [nostyle RIP 06 May 2025]
      • (Score: 5, Informative) by zocalo on Thursday July 24, @04:47PM (1 child)

        by zocalo (302) on Thursday July 24, @04:47PM (#1411311)
        Yeah, it's a choice, and not all options work for everyone. If you're cost/resource contrained, then the best options are definitely blocking or tarpitting (the latter is quite low resource if done right as it's just a *lot* of very, *very*, slow connections). A Raspberry Pi has more than enough grunt for a decent tarpit, but a low-end VM or older bare-metal server will do just fine too.

        For poisoning, we've used a few approaches, but the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers, either co-hosted on-prem/cloud with the actual servers, or to servers we host - Garbage as a Service / GaaS; you can probably work out some of the marketing. :) Once there, it's not a particularly resource heavy workload but you're basically giving them all the crap they can scape, at whatever bandwidth you can manage, until (or if!) the bot works out something is up, so not a great option if you're paying by the TB for outgoing traffic unless you really don't care about the cost. CPU resources depend on what you're generating and in what volumes, but it's really just a slightly more sophisticated version of "Lorum Ipsum" for text and/or random paragraphs of text drawn from out of copyright works from Project Gutenberg or similar, into which you insert images that can come from any free-to-use image/clipart collection(s) you can find and some links to the script that generates more garbage. The bots don't grok nonsensical out-of-context content; they're just scraping it, not trying to parse it.
        --
        UNIX? They're not even circumcised! Savages!
        • (Score: 3, Informative) by fliptop on Thursday July 24, @09:57PM

          by fliptop (1666) on Thursday July 24, @09:57PM (#1411353) Journal

          the most effective ones involve re-directing the bad actor to a different server/VM to offload the traffic from the actual production servers

          I have RewriteRule entries in httpd.conf similar to this:

          RewriteRule ^/(.*wp-admin.*)$ https://wordpress.com/$1 [wordpress.com] [L,R]

          which takes care of a lot of the bots looking for wordpress vulnerabilities. For other vulnerability scans, like POODLE, I just redirect to the appropriate CVE page, router exploits go to the router manufacturer's page, scans for shell access or /etc/passwd go to /dev/null. There's more but you get the picture.

          The most offensive cloud providers, in my experience, and in this order, are Google, Microsoft, AWS, Akamai and Oracle. There's a few smaller providers like Hurricane Electric, FranTech Solutions and PSInet that are persistent and bothersome too. I see others in there on occasion but Google is definitely out of control, especially the stuff they host in the 34.64.0.0/10 CIDR.

          Over the years I've added hundreds of thousands of IP addresses to my firewall, in some cases whole countries are blocked (yes Belarus, you're in there).

          --
          Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.
      • (Score: 5, Interesting) by VLM on Thursday July 24, @06:59PM

        by VLM (445) Subscriber Badge on Thursday July 24, @06:59PM (#1411328)

        Data poisoning would require us to keep serving fake garbage data to the bad guys at no actual benefit to ourselves.

        Simple work around: flip the sort order of mod points for detected AI (ab)users. Give them ALL the spam and hide all the human content.

        Humans get routed to article.pl, AI bots get routed to ai-bot-hell-article.pl, only difference is mod points sort order in the returned results. Or only give AC or SPAM modded results to AI.

        add random words to a paragraph

        Yeah I do that enough unintentionally when I fail at cut-n-paste editing.

      • (Score: 0) by Anonymous Coward on Friday July 25, @01:38AM

        by Anonymous Coward on Friday July 25, @01:38AM (#1411377)

        The stuff for the bots could be mostly from low CPU, lower bandwidth pre-compressed[1] static pages (which could be periodically generated from spam and -1 posts as per someone's suggestion).

        Poisoning can be better - takes longer for those getting poisoned to take countermeasures.

        [1] https://blog.llandsmeer.com/tech/2019/08/29/precompression.html [llandsmeer.com]

  • (Score: 3, Disagree) by janrinok on Thursday July 24, @02:39PM (2 children)

    by janrinok (52) Subscriber Badge on Thursday July 24, @02:39PM (#1411286) Journal

    We will have to look more closely at Anubis - it does seem like a good solution providing that the community are happy to have it. Thank you.

    In my quick reading of the content on the link that you gave, it seems to me that it relies on javascript. I could be wrong, but that is my initial impression.

    --
    [nostyle RIP 06 May 2025]
    • (Score: 4, Informative) by fab23 on Thursday July 24, @08:06PM

      by fab23 (6605) on Thursday July 24, @08:06PM (#1411335) Homepage Journal
      As I have recently commented [soylentnews.org] in another story:

      As far as I have learned Anubis has exceptions for some exotic browsers. I just recently watched the interesting and funny talk [youtube.com] from Xe at BSDCan 2025.

      And I just stumbled over No-JS Challenge [techaro.lol] through the posting Anubis now supports non-JS challenges [lobste.rs] on the other red site. It is also worth reading the comments from user cadey. :-)

      Another thing I have done on some of my sites was adding this at the end of my existing robots.txt. This does at least stop the AI crawlers who do honor it. And it also does still allow the crawlers from the same company (e.g. Apple or Google) for their search engines:

      User-agent: Applebot-Extended
      User-agent: Bytespider
      User-agent: ClaudeBot
      User-agent: Diffbot
      User-agent: FacebookBot
      User-agent: Google-Extended
      User-agent: GPTBot
      User-agent: Omgili
      Disallow: /

    • (Score: 2) by wirelessduck on Monday July 28, @05:05AM

      by wirelessduck (3407) on Monday July 28, @05:05AM (#1411785)

      There is also a list of AI user agents for blocking via robots.txt.

      https://github.com/ai-robots-txt/ai.robots.txt [github.com]

  • (Score: 3, Interesting) by janrinok on Thursday July 24, @03:48PM

    by janrinok (52) Subscriber Badge on Thursday July 24, @03:48PM (#1411301) Journal

    I see that it also runs on a Raspberry Pi - I think our budget can certainly stretch to that!

    --
    [nostyle RIP 06 May 2025]