
from the I'm-not-pirating-this-movie-I'm-training-my-AI-model dept.
Inside the secret list of websites that make AI like ChatGPT sound smart:
AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.
Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
This text is the AI's mainsource of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it's probably because its training data included thousands of LSAT practice sites.
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.
To look inside this black box, we analyzed Google's C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google's T5 and Facebook's LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
We then ranked the remaining 10 million websites based on how many "tokens" appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
[...] Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.
[...] The Post's analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.
The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.
[...] Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like sites.google.com No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.
The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.
[...] Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.
[...] A web crawl may sound like a copy of the entire internet, but it's just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.
[...] As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.
Related Stories
A proposed set of rules by the European Union would, among other things. require makers of generative AI tools such as ChatGPT,to publicize any copyrighted material used by the technology platforms to create content of any kind.
A new draft of European Parliament's legislation, a copy of which was attained by The Wall Street Journal, would allow the original creators of content used by generative AI applications to share in any profits that result.
The European Union's "Artificial Intelligence Act" (AI Act) is the first of its kind by a western set of nations. The proposed legislation relies heavily on existing rules, such as the General Data Protection Regulation (GDPR), the Digital Services Act, and the Digital Markets Act. The AI Act was originally proposed by the European Commission in April 2021.
The bill's provisions also require that the large language models (LLMs) behind generative AI tech, such as the GPT-4, be designed with adequate safeguards against generating content that violates EU laws; that could include child pornography or, in some EU countries, denial of the Holocaust, according to The Washington Post.
[...] But the solution to keeping AI honest isn't easy, according to Avivah Litan, a vice president and distinguished analyst at Gartner Research. It's likely that LLM creators, such as San Fransisco-based OpenAI and others, will need to develop powerful LLMs to check that the ones trained initially have no copyrighted materials. Rules-based systems to filter out copyright materials are likely to be ineffective, Liten said.
They were asked about it, and they deleted everything:
There was nothing in Drew Ortiz's author biography at Sports Illustrated to suggest that he was anything other than human.
"Drew has spent much of his life outdoors, and is excited to guide you through his never-ending list of the best products to keep you from falling to the perils of nature," it read. "Nowadays, there is rarely a weekend that goes by where Drew isn't out camping, hiking, or just back on his parents' farm."
The only problem? Outside of Sports Illustrated, Drew Ortiz doesn't seem to exist. He has no social media presence and no publishing history. And even more strangely, his profile photo on Sports Illustrated is for sale on a website that sells AI-generated headshots, where he's described as "neutral white young-adult male with short brown hair and blue eyes."
Ortiz isn't the only AI-generated author published by Sports Illustrated, according to a person involved with the creation of the content who asked to be kept anonymous to protect them from professional repercussions.
"There's a lot," they told us of the fake authors. "I was like, what are they? This is ridiculous. This person does not exist."
[...] The AI content marks a staggering fall from grace for Sports Illustrated, which in past decades won numerous National Magazine Awards for its sports journalism and published work by literary giants ranging from William Faulkner to John Updike.
But now that it's under the management of The Arena Group, parts of the magazine seem to have devolved into a Potemkin Village in which phony writers are cooked up out of thin air, outfitted with equally bogus biographies and expertise to win readers' trust, and used to pump out AI-generated buying guides that are monetized by affiliate links to products that provide a financial kickback when readers click them.
What's next? Six-fingered AI-generated models for the swimsuit edition?
Related:
- The AI Hype Bubble is the New Crypto Hype Bubble
- OpenAI Has Released the Largest Version Yet of its Fake-News-Spewing AI
- Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart
(Score: 3, Funny) by darkfeline on Thursday April 20 2023, @07:01PM (7 children)
You know, it's curious. Humans also mimic human speech (writing) by ingesting a lot of text. It's almost like humans are powered by a neural net, but as we all know humans actually have souls.
Join the SDF Public Access UNIX System today!
(Score: 3, Informative) by vux984 on Thursday April 20 2023, @07:14PM (6 children)
I don't think we have "souls", but we are more than just a neural net creating text that looks like human speech.
(Score: 0) by Anonymous Coward on Thursday April 20 2023, @08:58PM (4 children)
...and we're more than the neural net in a Tesla that can't tell when there is a stopped emergency vehicle blocking the lane (at least most of us humans are, maybe I should remove drunk drivers from the list).
(Score: 3, Funny) by DannyB on Thursday April 20 2023, @09:05PM (3 children)
After enough collisions with stopped emergency vehicles, one would think Tesla's neural net would learn these things.
It is insanity to keep doing while(true) { do something }, and expect different results.
The Centauri traded Earth jump gate technology in exchange for our superior hair mousse formulas.
(Score: 2) by Freeman on Friday April 21 2023, @01:42PM (2 children)
But how will they actually go, if they decide that all immobile objects should be avoided. They just need to invent matter phasing, so they can zoom right through the obstacle without hitting anything. Much more realistic goal.
Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
(Score: 2) by DannyB on Friday April 21 2023, @02:12PM (1 child)
With sufficient speed a vehicle might be able to zoom right through an obstacle, similar to a bullet.
The Centauri traded Earth jump gate technology in exchange for our superior hair mousse formulas.
(Score: 3, Interesting) by Freeman on Monday April 24 2023, @09:25PM
Closest empirical evidence I've found is Mythbusters using a rocket sled to nearly vaporize a vehicle. I mean, it certainly didn't phase through anything, but I'm quite sure the theoretical speed necessary would be much higher than the rocket sled could get to.
Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
(Score: 0) by Anonymous Coward on Friday April 21 2023, @05:24PM
(Score: 2, Funny) by Gaaark on Thursday April 20 2023, @11:55PM (1 child)
Betting Fox news is one of them....
--- Please remind me if I haven't been civil to you: I'm channeling MDC. I have always been here. ---Gaaark 2.0 --
(Score: 2) by DeathMonkey on Friday April 21 2023, @07:48PM
Can't be, it says they sound smart.
(Score: 0) by Anonymous Coward on Friday April 21 2023, @05:58PM
These are probably from news sources. "News" tends to be a caricature of reality (you normally don't report/publicize the boring stuff). So "warts", flaws and other features tend to be exaggerated BUT generally the caricatures have some resemblance to reality.
And the reality is Muhammad waged wars ( https://en.wikipedia.org/wiki/Military_career_of_Muhammad [wikipedia.org] ) and the Quran says that he is an excellent pattern (to emulate/follow): https://corpus.quran.com/translation.jsp?chapter=33&verse=21 [quran.com]
Islam is not a "turn the other cheek" religion. And plenty of Muslims like it that way.
So while genuine Muslims (those who follow Muhammad's example) certainly don't commit violent actions 66 percent of the time, they're more likely to commit violence than genuine Christians (those who follow Jesus's example) or Buddhists (those who follow the examples and teachings of the Buddhas).
If Muslims feel they are being persecuted, far fewer of them are going to do "Buddhist monk style self immolations"[1] or do the "bless those who curse us and pray for those who ill-treat us"[2]. There's a higher chance of them going to war and thus appearing in news articles committing violent acts.
Maybe the authors have their own bias?
See also: https://wikiislam.net/wiki/List_of_Killings_Ordered_or_Supported_by_Muhammad#List_of_Killings [wikiislam.net]
[1] https://theconversation.com/understanding-self-immolation-in-buddhism-after-wynn-bruces-earth-day-action-182007 [theconversation.com]
[2] https://scriptureunion.org/dailydiscovery/a-prayer-of-the-persecuted/ [scriptureunion.org]