Stories
Slash Boxes
Comments

SoylentNews is people

posted by hubie on Monday July 08, @06:31PM   Printer-friendly

Arthur T Knackerbracket has processed the following story:

Cloudflare has released a new free tool that prevents AI companies' bots from scraping its clients' websites for content to train large language models. The cloud service provider is making this tool available to its entire customer base, including those on free plans. "This feature will automatically be updated over time as we see new fingerprints of offending bots we identify as widely scraping the web for model training," the company said.

In a blog post announcing this update, Cloudflare's team also shared some data about how its clients are responding to the boom of bots that scrape content to train generative AI models. According to the company's internal data, 85.2 percent of customers have chosen to block even the AI bots that properly identify themselves from accessing their sites.

[...] It's proving very difficult to fully and consistently block AI bots from accessing content. The arms race to build models faster has led to instances of companies skirting or outright breaking the existing rules around blocking scrapers. Perplexity AI was recently accused of scraping websites without the required permissions. But having a backend company at the scale of Cloudflare getting serious about trying to put the kibosh on this behavior could lead to some results.

"We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection," the company said. "We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on."


Original Submission

This discussion was created by hubie (1068) for logged-in users only. Log in and try again!
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Funny) by EJ on Monday July 08, @07:32PM (1 child)

    by EJ (2452) on Monday July 08, @07:32PM (#1363479)

    AI can't click those buttons. It's one of the three laws.

    • (Score: 2, Touché) by anubi on Monday July 08, @10:19PM

      by anubi (2828) on Monday July 08, @10:19PM (#1363495) Journal

      Yup...AIs have their own "browser". They can do everything a human does...and read the source!

      --
      "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
  • (Score: 4, Interesting) by VLM on Monday July 08, @08:08PM (3 children)

    by VLM (445) on Monday July 08, @08:08PM (#1363480)

    1) I wonder how many of cloudflare's customers have content generated by AI.

    2) I know the AI peeps are not happy about the idea of AI trained on AI generated content, so I wonder if in the long run a contractual requirement for using AI in public might be blocking AI scrapers, inevitably resulting in browser filtering tools that block AI spam and AI propaganda by not showing content that blocks bots. That would be a plausible, interesting outcome.

    • (Score: 3, Interesting) by ikanreed on Monday July 08, @08:12PM

      by ikanreed (3164) Subscriber Badge on Monday July 08, @08:12PM (#1363481) Journal

      1) I wonder how many of cloudflare's customers have content generated by AI.

      Eh, some. In general, the LLM-generated content is mostly worthless, and as a result isn't particularly valuable to scale and protect from DDOSing. The money you spend on that could instead be spent on another thousand worthless websites to SEO-spam with. Probably 1 out 10,000 such sites accidentally generate enough traffic to be worth it.

    • (Score: 2) by mcgrew on Tuesday July 09, @06:51PM (1 child)

      by mcgrew (701) <publish@mcgrewbooks.com> on Tuesday July 09, @06:51PM (#1363578) Homepage Journal

      I'd like one to keep AI scrapers off of my sites, but a quick look at TFA didn't give the tool's name. What's it called? Where can I get it? Is it open source? Maybe my host already has one, they have a lot of tools I never use.

      --
      mcgrewbooks.com mcgrew.info nooze.org
      • (Score: 2) by VLM on Wednesday July 10, @12:39PM

        by VLM (445) on Wednesday July 10, @12:39PM (#1363640)

        I wonder how effective a robots.txt file would be against people with a financial incentive to ignore it.

        https://netfuture.ch/2023/07/blocking-ai-crawlers-robots-txt-chatgpt/#Example-robotstxt [netfuture.ch]

        My guess is if you combine logs and some IP address scans every linked page at site A in a few seconds, then site B, then site C, they can write a rule to block it from the entire webhost for all other 9997 sites they host.

        I wonder if various countermeasures will eventually result in outsourced botnets being used to gather AI data... Another novel work around... how many people would accept $1/day for "big company" to monitor their entire internet browser access? Remember, numerous big companies and governments already do this and don't pay the user $1/day. Just a novel idea.

  • (Score: 1, Interesting) by Anonymous Coward on Monday July 08, @10:31PM (5 children)

    by Anonymous Coward on Monday July 08, @10:31PM (#1363497)

    I occasionally wander into their domain via click bait.

    I've never made it to the end of the story. I end up googling why I need to place a cup under the toilet seat.

    Is all that stuff Ai generated on the fly just to see how long we will click the correct button, as clicking the incorrect button sends us into another maze of ads.

    It's like stepping into fresh dog poo...internet style.

    • (Score: 1, Insightful) by Anonymous Coward on Monday July 08, @10:49PM (1 child)

      by Anonymous Coward on Monday July 08, @10:49PM (#1363501)

      uBlock Origin has prevented the following page from loading:

      http://taboola.com/ [taboola.com]

      Because of the following filter:

      ||taboola.com^

      uBlock Origin has prevented the following page from loading:

      http://outbrain.com/ [outbrain.com]

      Because of the following filter:

      ||outbrain.com^
      Found in:

              Peter Lowe’s Ad and tracking server list
               

      If you choose to wander the web naked, be prepared to suffer the consequences. Seriously, you wouldn't go out picking blackberries with no clothes, would you?

      • (Score: 2) by Freeman on Tuesday July 09, @02:17PM

        by Freeman (732) on Tuesday July 09, @02:17PM (#1363545) Journal

        Picking blackberries naked is a lot safer.

        --
        Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
    • (Score: 2) by Freeman on Tuesday July 09, @02:21PM (2 children)

      by Freeman (732) on Tuesday July 09, @02:21PM (#1363546) Journal

      I ended up on one "never ending story" well before the AI hypetrain hit. Generative AI may help feed it, but it's always been horrible. I figure, if it's baity, I didn't need to know about it in the first place. In the event that "this one simple trick" / "make $$$$$/wk" gimmicks etc were true, then it wouldn't be some click-baity headline. I ignore 99% of the "news" and my sanity thanks me.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by mcgrew on Tuesday July 09, @06:54PM (1 child)

        by mcgrew (701) <publish@mcgrewbooks.com> on Tuesday July 09, @06:54PM (#1363579) Homepage Journal

        I ignore 99% of the "news"

        Getting your news from social media is foolish to the point of insanity, but apparently, according to the number of Snopes articles about clickbait, PT Barnum was right.

        --
        mcgrewbooks.com mcgrew.info nooze.org
        • (Score: 2) by Freeman on Wednesday July 10, @06:29PM

          by Freeman (732) on Wednesday July 10, @06:29PM (#1363671) Journal

          SoylentNews is essentially the only "Social Media" that I use. In the event that the "News" is not newsworthy enough for everybody to be talking about it, then it's probably meant to manipulate you in some way or fashion anyway. Mainstream news isn't what it used to be and I've essentially given up on Television in general. It's probably been 20 years since I regularly watched Over-the-air TV. I also only paid for Satellite TV under duress for a short period of time.

          I do occasionally visit the likes of Arstechnica as well as SoylentNews.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
(1)