Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by janrinok on Thursday April 20 2023, @04:28PM   Printer-friendly
from the I'm-not-pirating-this-movie-I'm-training-my-AI-model dept.

Inside the secret list of websites that make AI like ChatGPT sound smart:

AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.

Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.

This text is the AI's mainsource of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it's probably because its training data included thousands of LSAT practice sites.

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.

To look inside this black box, we analyzed Google's C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google's T5 and Facebook's LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.

We then ranked the remaining 10 million websites based on how many "tokens" appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.

[...] Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.

[...] The Post's analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.

The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.

[...] Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like sites.google.com No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.

The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.

[...] Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.

[...] A web crawl may sound like a copy of the entire internet, but it's just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.

[...] As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Funny) by darkfeline on Thursday April 20 2023, @07:01PM (7 children)

    by darkfeline (1030) on Thursday April 20 2023, @07:01PM (#1302280) Homepage

    You know, it's curious. Humans also mimic human speech (writing) by ingesting a lot of text. It's almost like humans are powered by a neural net, but as we all know humans actually have souls.

    --
    Join the SDF Public Access UNIX System today!
    Starting Score:    1  point
    Moderation   +1  
       Funny=1, Total=1
    Extra 'Funny' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 3, Informative) by vux984 on Thursday April 20 2023, @07:14PM (6 children)

    by vux984 (5045) on Thursday April 20 2023, @07:14PM (#1302282)

    I don't think we have "souls", but we are more than just a neural net creating text that looks like human speech.

    • (Score: 0) by Anonymous Coward on Thursday April 20 2023, @08:58PM (4 children)

      by Anonymous Coward on Thursday April 20 2023, @08:58PM (#1302297)

      ...and we're more than the neural net in a Tesla that can't tell when there is a stopped emergency vehicle blocking the lane (at least most of us humans are, maybe I should remove drunk drivers from the list).

      • (Score: 3, Funny) by DannyB on Thursday April 20 2023, @09:05PM (3 children)

        by DannyB (5839) Subscriber Badge on Thursday April 20 2023, @09:05PM (#1302298) Journal

        After enough collisions with stopped emergency vehicles, one would think Tesla's neural net would learn these things.

        It is insanity to keep doing while(true) { do something }, and expect different results.

        --
        A WHERE clause on a SQL UPDATE statement is just adding unnecessary complexity to something simple.
        • (Score: 2) by Freeman on Friday April 21 2023, @01:42PM (2 children)

          by Freeman (732) on Friday April 21 2023, @01:42PM (#1302396) Journal

          But how will they actually go, if they decide that all immobile objects should be avoided. They just need to invent matter phasing, so they can zoom right through the obstacle without hitting anything. Much more realistic goal.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
          • (Score: 2) by DannyB on Friday April 21 2023, @02:12PM (1 child)

            by DannyB (5839) Subscriber Badge on Friday April 21 2023, @02:12PM (#1302401) Journal

            With sufficient speed a vehicle might be able to zoom right through an obstacle, similar to a bullet.

            --
            A WHERE clause on a SQL UPDATE statement is just adding unnecessary complexity to something simple.
            • (Score: 3, Interesting) by Freeman on Monday April 24 2023, @09:25PM

              by Freeman (732) on Monday April 24 2023, @09:25PM (#1302892) Journal

              Closest empirical evidence I've found is Mythbusters using a rocket sled to nearly vaporize a vehicle. I mean, it certainly didn't phase through anything, but I'm quite sure the theoretical speed necessary would be much higher than the rocket sled could get to.

              --
              Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
    • (Score: 0) by Anonymous Coward on Friday April 21 2023, @05:24PM

      by Anonymous Coward on Friday April 21 2023, @05:24PM (#1302439)
      Where does that consciousness thing come from? Assuming you experience it too...