Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by janrinok on Thursday July 24, @10:15AM   Printer-friendly

Over recent weeks we have been experiencing connections from a large number of bots, spiders and scrapers. Some are the expected ones (Microsoft, Google, Amazon etc) and these tend to rate limit their requests and cause us little problem.

Others appear to be AI driven scrapers and they can result in tying up a large percentage of the site's resources. For the most part they ignore robots.txt or when we return code 429. While they are individually only an annoyance their activity can affect the speed at which the site can respond to members attempts to view a page or leave a comment. They have contributed to some of the 404 or 503 (Backend Fetch Failed) that you might have experienced recently. A small number of bots isn't a problem, but if many bots are querying the site at the same time then they can affect the speed at which the site can respond to your comment or request.

Software has been developed to block such abusive sites for a short period. In the majority of cases this will be invisible to you as users other than to hopefully improve the responsiveness of the site.

However, it is possible that sometimes there might be a false positive and you may encounter difficulties in connecting to the site. If you do experience connection problems please inform us immediately either by email or on IRC. Neither of those apply filters to connections; the short temporary blocks only apply to the site itself. We will have to contact you by email to ascertain your IP address so that we can lift any block that may have been incorrectly applied. Please do not publish an IP address in either a comment or on IRC.

If you are using a VPN or Tor it might be advisable to try another routing to circumvent any temporary block that might be affecting your connection.

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by ls671 on Thursday July 24, @02:18PM

    by ls671 (891) Subscriber Badge on Thursday July 24, @02:18PM (#1411284) Homepage

    I host a few dozen web sites and use mod_security and mod_qos to keep things healthy at the reverse proxy level. crs rules, custom rules, DNS based black lists you can attribute a weight to, geoiplookup and custom ip list, etc. You can even set the weight to refuse depending on the country or the ip etc.

    mod_security will do anything you want if you can write your custom rules but there are plenty already available.

    I don't do anything intrusive like captcha or prove that you are human. It's completely transparent for the user.

    For blog spam, I simply make the blog send emails to itself filtering it with spam assassin with custom rules for each blog and custom Bayesian training it works pretty well.

    For AI I basically block the user agent but you might need to manually block some ips very occasionally. Mod security also has configurable web site flood control where you can start rejecting an ip just because it is making too many requests too fast etc.

    I guess in short mod_security is like the Swiss knife of web hosting.

    --

    Everything I write is lies, including this sentence.
    Starting Score:    1  point
    Moderation   +4  
       Interesting=1, Informative=3, Total=4
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5