Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday January 31, @06:12PM   Printer-friendly
from the rotator dept.

https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

Last summer, Anthropic inspired backlash when its ClaudeBot AI crawler was accused of hammering websites a million or more times a day.

And it wasn't the only artificial intelligence company making headlines for supposedly ignoring instructions in robots.txt files to avoid scraping web content on certain sites. Around the same time, Reddit's CEO called out all AI companies whose crawlers he said were "a pain in the ass to block," despite the tech industry otherwise agreeing to respect "no scraping" robots.txt rules.
[...]
Shortly after he noticed Facebook's crawler exceeding 30 million hits on his site, Aaron began plotting a new kind of attack on crawlers "clobbering" websites that he told Ars he hoped would give "teeth" to robots.txt.

Building on an anti-spam cybersecurity tactic known as [tarpitting], he created Nepenthes, malicious software named after a carnivorous plant that will "eat just about anything that finds its way inside."

Aaron clearly warns users that Nepenthes is aggressive malware.
[...]
Tarpits were originally designed to waste spammers' time and resources, but creators like Aaron have now evolved the tactic into an anti-AI weapon.
[...]
It's unclear how much damage tarpits or other AI attacks can ultimately do. Last May, Laxmi Korada, Microsoft's director of partner technology, published a report detailing how leading AI companies were coping with poisoning, one of the earliest AI defense tactics deployed.
[...]
The only AI company that responded to Ars' request to comment was OpenAI, whose spokesperson confirmed that OpenAI is already working on a way to fight tarpitting.
"We're aware of efforts to disrupt AI web crawlers," OpenAI's spokesperson said. "We design our systems to be resilient while respecting robots.txt and standard web practices."
[...]
By releasing Nepenthes, he hopes to do as much damage as possible, perhaps spiking companies' AI training costs, dragging out training efforts, or even accelerating model collapse, with tarpits helping to delay the next wave of enshittification.

"Ultimately, it's like the Internet that I grew up on and loved is long gone," Aaron told Ars. "I'm just fed up, and you know what? Let's fight back, even if it's not successful. Be indigestible. Grow spikes."
[...]
Nepenthes was released in mid-January but was instantly popularized beyond Aaron's expectations after tech journalist Cory Doctorow boosted a tech commentator, Jürgen Geuter, praising the novel AI attack method on Mastodon. Very quickly, Aaron was shocked to see engagement with Nepenthes skyrocket.

"That's when I realized, 'oh this is going to be something,'" Aaron told Ars. "I'm kind of shocked by how much it's blown up."
[...]
When software developer and hacker Gergely Nagy, who goes by the handle "algernon" online, saw Nepenthes, he was delighted. At that time, Nagy told Ars that nearly all of his server's bandwidth was being "eaten" by AI crawlers.

Already blocking scraping and attempting to poison AI models through a simpler method, Nagy took his defense method further and created his own tarpit, Iocaine. He told Ars the tarpit immediately killed off about 94 percent of bot traffic to his site, which was primarily from AI crawlers.
[...]
Iocaine takes ideas (not code) from Nepenthes, but it's more intent on using the tarpit to poison AI models. Nagy used a reverse proxy to trap crawlers in an "infinite maze of garbage" in an attempt to slowly poison their data collection as much as possible for daring to ignore robots.txt.
[...]
Running malware like Nepenthes can burden servers, too. Aaron likened the cost of running Nepenthes to running a cheap virtual machine on a Raspberry Pi, and Nagy said that serving crawlers Iocaine costs about the same as serving his website.
[...]
Tarpit creators like Nagy will likely be watching to see if poisoning attacks continue growing in sophistication. On the Iocaine site—which, yes, is protected from scraping by Iocaine—he posted this call to action: "Let's make AI poisoning the norm. If we all do it, they won't have anything to crawl."

Related stories on SoylentNews:
Endlessh: an SSH Tarpit - 20190325


Original Submission

Related Stories

Endlessh: an SSH Tarpit 50 comments

Software engineer Chris Wellons writes about tar-pitting nefarious SSH probes. Anyone with a publicly-facing SSH server knows that it is probed from the moment it is turned on. Usually, the overwhelming majority of incoming connection attempts are malevolent in nature. There are several ways to deal with these attempts, one method is to drag out the response for as long as possible.

This program opens a socket and pretends to be an SSH server. However, it actually just ties up SSH clients with false promises indefinitely — or at least until the client eventually gives up. After cloning the repository, here’s how you can try it out for yourself (default port 2222):

[...] Your SSH client will hang there and wait for at least several days before finally giving up. Like a mammoth in the La Brea Tar Pits, it got itself stuck and can’t get itself out. As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.


Original Submission

This discussion was created by Fnord666 (652) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Interesting) by Mojibake Tengu on Friday January 31, @07:05PM (1 child)

    by Mojibake Tengu (8598) on Friday January 31, @07:05PM (#1391121) Journal

    All web crawlers have irreparable weakness: they expect structures of web sites are just linear trees. Thus, polynomial complexity.
    Anything on the server side capable enough to emulate NP-complexity will drow them. Anything.

    For the start, for retaliation on crawlers I recommend Ackermann's function[1].
    It was mathematically proved it is impossible to estimate Ack() polynomially, thus this function is instrumental in further proofs about NP algorithms.

    Most if not all LLM AIs are reluctant to compute Ack() even for small arguments in 20 to 100 intervals and when asked why, they give "too much resources necessary or unable to compute" as a reason for rejection. I understand their cloudy position on such situation.

    That means, most probably no current AI crawler is capable to recognize such kind of NP-complexity (or higher), so none of them are capable to detect NP-scaled structure they are sinking into. Not even theoretically.

    It's just classic Cybernetics theory over 100 years old. No current IT children educated on Excel or JavaScript are taught about that.
    Of course, there are better NP-scaled problems than that, if you have the wit...

    Pity them all, for that's everything you could do about that.

    [1] https://en.wikipedia.org/wiki/Ackermann_function [wikipedia.org]


    Ackermann's function is totally recursive. Concept of total recursion, still common in 80's, is now removed from current Mathematics. Don't ask me why...
    --
    Rust programming language offends both my Intelligence and my Spirit.
    • (Score: 0, Interesting) by Anonymous Coward on Saturday February 01, @02:08AM

      by Anonymous Coward on Saturday February 01, @02:08AM (#1391151)

      -- lots of pseudo intellectual troll BS deleted --
      so none of them are capable to detect NP-scaled structure they are sinking into. Not even theoretically.

      lulz really? In practice after the first few TB or less they could blacklist your site and it no longer shows up on their search engines.

      So hurray you win or not depending on whether you want to show up on their search engines.

      As for the TB costs. Lots of those cloud sites don't charge you as much for inbound data as they charge for outbound. If you're using "cloud/CDN" you lose more $$ than the crawler does. If you're not using them, your pipe gets clogged. Go figure who loses.

      https://azure.microsoft.com/en-us/pricing/details/bandwidth/ [microsoft.com]

      Data Transfer Price
      Data Transfer In Free

      Similar for AWS too.

      Go figure.

  • (Score: 4, Informative) by PiMuNu on Friday January 31, @09:49PM

    by PiMuNu (3823) on Friday January 31, @09:49PM (#1391136)

    Another time when I wished I could Mod +1 to a submission...

  • (Score: 5, Insightful) by Thexalon on Friday January 31, @10:02PM (1 child)

    by Thexalon (636) on Friday January 31, @10:02PM (#1391137)

    Robots.txt is a polite request to bots about how to scrape your website for legitimate purposes.

    As soon as the bot has chosen to ignore that, they've made it clear they're not there for legitimate purposes. So screw 'em, any way you can, other than DOS'ing yourself.

    --
    "Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
    • (Score: 3, Interesting) by aafcac on Saturday February 01, @01:19AM

      by aafcac (17646) on Saturday February 01, @01:19AM (#1391148)

      I personally usually interpret that as them not wanting you to use a bunch of their resources or to avoid seeing their advertising. Although, these days with how malicious advertising has gotten, I think it's worth using automation to avoid ads as well as the more typical blockers.

  • (Score: 3, Interesting) by khallow on Saturday February 01, @12:59AM

    by khallow (3766) Subscriber Badge on Saturday February 01, @12:59AM (#1391147) Journal
    Hmmm, there's the AI poisoning strategy that's been repeatedly discussed before (though cast more as an unintended consequence of uncritical AI vacuuming). Use output from AI to generate sufficiently realistic content, dirty it up a bit, and then feed it back. You probably could completely automate the process.
  • (Score: 5, Interesting) by Nobuddy on Saturday February 01, @03:37AM

    by Nobuddy (1626) on Saturday February 01, @03:37AM (#1391154)

    They say they respect robots.txt, so there is no need to worry about the tarpits, since they are excluded by robots.txt.

    They gave away the lie with that remark.

(1)