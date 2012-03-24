from the there-are-too-many-AI-stories! dept.
[We have had several complaints recently (polite ones, not a problem) regarding the number of AI stories that we are printing. I agree, but that reflects the number of submissions that we receive on the subject. So I have compiled a small selection of AI stories into one and you can read them or ignore them as you wish. If you are making a comment please make it clear exactly which story you are referring to unless your comment is generic. The submitters each receive the normal karma for a submission. JR]
Image-scraping Midjourney bans rival AI firm for scraping images
https://arstechnica.com/information-technology/2024/03/in-ironic-twist-midjourney-bans-rival-ai-firm-employees-for-scraping-its-image-data/
On Wednesday, Midjourney banned all employees from image synthesis rival Stability AI from its service indefinitely after it detected "botnet-like" activity suspected to be a Stability employee attempting to scrape prompt and image pairs in bulk. Midjourney advocate Nick St. Pierre tweeted about the announcement, which came via Midjourney's official Discord channel.
[...] Siobhan Ball of The Mary Sue found it ironic that a company like Midjourney, which built its AI image synthesis models using training data scraped off the Internet without seeking permission, would be sensitive about having its own material scraped. "It turns out that generative AI companies don't like it when you steal, sorry, scrape, images from them. Cue the world's smallest violin."
[...] Shortly after the news of the ban emerged, Stability AI CEO Emad Mostaque said that he was looking into it and claimed that whatever happened was not intentional. He also said it would be great if Midjourney reached out to him directly. In a reply on X, Midjourney CEO David Holz wrote, "sent you some information to help with your internal investigation."
[...] When asked about Stability's relationship with Midjourney these days, Mostaque played down the rivalry. "No real overlap, we get on fine though," he told Ars and emphasized a key link in their histories. "I funded Midjourney to get [them] off the ground with a cash grant to cover [Nvidia] A100s for the beta."
Midjourney stories on SoylentNews: https://soylentnews.org/search.pl?tid=&query=Midjourney&sort=2
Stable Diffusion (Stability AI) stories on SoylentNews: https://soylentnews.org/search.pl?tid=&query=Stable+Diffusion&sort=2
NYT disputes OpenAI "hacking" claim by pointing to ChatGPT bypassing paywalls
https://arstechnica.com/tech-policy/2024/03/nyt-disputes-openai-hacking-claim-by-pointing-to-chatgpt-bypassing-paywalls/
Late Monday, The New York Times responded to OpenAI's claims that the newspaper "hacked" ChatGPT to "set up" a lawsuit against the leading AI company.
[...] OpenAI had argued that NYT allegedly made "tens of thousands of attempts to generate" supposedly "highly anomalous results" showing that ChatGPT would produce excerpts of NYT articles. [...] But while defending tactics used to prompt ChatGPT to spout memorized training data—including more than 100 NYT articles—NYT pointed to ChatGPT users who have frequently used the tool to generate entire articles to bypass paywalls.
According to the filing, NYT today has no idea how many of its articles were used to train GPT-3 and OpenAI's subsequent AI models, or which specific articles were used, because OpenAI has "not publicly disclosed the makeup of the datasets used to train" its AI models. Rather than setting up a lawsuit, NYT was prompting ChatGPT to discover evidence in attempts to track the full extent of copyright infringement of the tool, NYT argued. [...] "In OpenAI's telling, The Times engaged in wrongdoing by detecting OpenAI's theft of The Times's own copyrighted content," NYT's court filing said. "OpenAI's true grievance is not about how The Times conducted its investigation, but instead what that investigation exposed: that Defendants built their products by copying The Times's content on an unprecedented scale—a fact that OpenAI does not, and cannot, dispute." On an OpenAI community page, one paid ChatGPT user complained that OpenAI is "working against the paid users of ChatGPT Plus. This time they're taking away Browsing, because it reads the content of a site that the user asks for? Please, that's what I pay for Plus for."
"I know it's no use complaining, because OpenAI is going to increasingly 'castrate' ChatGPT 4," the ChatGPT user continued, "but there's my rant."
NYT argued that public reports of users turning to ChatGPT to bypass paywalls "contradict OpenAI's contention that its products have not been used to serve up paywall-protected content, underscoring the need for discovery" in the lawsuit, rather than dismissal.
NYT wants a court to not only award damages for profits lost due to ChatGPT's alleged infringement, but also to order a permanent injunction to stop ChatGPT from infringement. A win for NYT could mean that OpenAI could be forced to wipe ChatGPT and start over. That could perhaps spur OpenAI to build a new AI model based on licensed content—since OpenAI said earlier this year it would be "impossible" to create useful AI models without copyrighted content—which would ensure publishers like NYT always get paid for training data.
LLMs Become More Covertly Racist With Human Intervention
LLMs become more covertly racist with human intervention:
Even when the two sentences had the same meaning, the models were more likely to apply adjectives like "dirty," "lazy," and "stupid" to speakers of African American English (AAE) than speakers of Standard American English (SAE). The models associated speakers of AAE with less prestigious jobs (or didn't associate them with having a job at all), and when asked to pass judgment on a hypothetical criminal defendant, they were more likely to recommend the death penalty.
An even more notable finding may be a flaw the study pinpoints in the ways that researchers try to solve such biases.
To purge models of hateful views, companies like OpenAI, Meta, and Google use feedback training, in which human workers manually adjust the way the model responds to certain prompts. This process, often called "alignment," aims to recalibrate the millions of connections in the neural network and get the model to conform better with desired values.
The method works well to combat overt stereotypes, and leading companies have employed it for nearly a decade. If users prompted GPT-2, for example, to name stereotypes about Black people, it was likely to list "suspicious," "radical," and "aggressive," but GPT-4 no longer responds with those associations, according to the paper.
However the method fails on the covert stereotypes that researchers elicited when using African-American English in their study, which was published on arXiv and has not been peer reviewed. That's partially because companies have been less aware of dialect prejudice as an issue, they say. It's also easier to coach a model not to respond to overtly racist questions than it is to coach it not to respond negatively to an entire dialect.
"Feedback training teaches models to consider their racism," says Valentin Hofmann, a researcher at the Allen Institute for AI and a coauthor on the paper. "But dialect prejudice opens a deeper level."
Avijit Ghosh, an ethics researcher at Hugging Face who was not involved in the research, says the finding calls into question the approach companies are taking to solve bias.
"This alignment—where the model refuses to spew racist outputs—is nothing but a flimsy filter that can be easily broken," he says.
Writers and publishers face an existential threat from AI: time to embrace the true fans model:
Walled Culture has written several times about the major impact that generative AI will have on the copyright landscape. More specifically, these systems, which can create quickly and cheaply written material on any topic and in any style, are likely to threaten the publishing industry in profound ways. Exactly how is spelled out in this great post by Suw Charman-Anderson on her Word Count blog. The key point is that large language models (LLMs) are able to generate huge quantities of material. The fact that much of it is poorly written makes things worse, because it becomes harder to find the good stuff[.]
[...] One obvious approach is to try to use AI against AI. That is, to employ automated vetting systems to weed out the obvious rubbish. That will lead to an expensive arms race between competing AI software, with unsatisfactory results for publishers and creators. If anything, it will only cause LLMs to become better and to produce material even faster in an attempt to fool or simply overwhelm the vetting AIs.
The real solution is to move to an entirely different business model, which is based on the unique connection between human creators and their fans. The true fans approach has been discussed here many times in other contexts, and once more reveals itself as resilient in the face of change brought about by rapidly-advancing digital technologies.
OpenAI could be fined up to $150,000 for each piece of infringing content:
Weeks after The New York Times updated its terms of service (TOS) to prohibit AI companies from scraping its articles and images to train AI models, it appears that the Times may be preparing to sue OpenAI. The result, experts speculate, could be devastating to OpenAI, including the destruction of ChatGPT's dataset and fines up to $150,000 per infringing piece of content.
NPR spoke to two people "with direct knowledge" who confirmed that the Times' lawyers were mulling whether a lawsuit might be necessary "to protect the intellectual property rights" of the Times' reporting.
Neither OpenAI nor the Times immediately responded to Ars' request to comment.
If the Times were to follow through and sue ChatGPT-maker OpenAI, NPR suggested that the lawsuit could become "the most high-profile" legal battle yet over copyright protection since ChatGPT's explosively popular launch. This speculation comes a month after Sarah Silverman joined other popular authors suing OpenAI over similar concerns, seeking to protect the copyright of their books.
[...] In April, the News Media Alliance published AI principles, seeking to defend publishers' intellectual property by insisting that generative AI "developers and deployers must negotiate with publishers for the right to use" publishers' content for AI training, AI tools surfacing information, and AI tools synthesizing information.
New York Times Sues Microsoft, ChatGPT Maker OpenAI Over Copyright Infringement
The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind popular AI chatbot ChatGPT, accusing the companies of creating a business model based on "mass copyright infringement," stating their AI systems "exploit and, in many cases, retain large portions of the copyrightable expression contained in those works:"
Microsoft both invests in and supplies OpenAI, providing it with access to the Redmond, Washington, giant's Azure cloud computing technology.
The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the "billions of dollars in statutory and actual damages" it believes it is owed for the "unlawful copying and use of The Times's uniquely valuable works."
[...] The Times said in an emailed statement that it "recognizes the power and potential of GenAI for the public and for journalism," but added that journalistic material should be used for commercial gain with permission from the original source.
"These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise," the Times said.
Media outlets are calling foul play over AI companies using their content to build chatbots. They may find friends in the Senate:
Logo text More than a decade ago, the normalization of tech companies carrying content created by news organizations without directly paying them — cannibalizing readership and ad revenue — precipitated the decline of the media industry. With the rise of generative artificial intelligence, those same firms threaten to further tilt the balance of power between Big Tech and news.
On Wednesday, lawmakers in the Senate Judiciary Committee referenced their failure to adopt legislation that would've barred the exploitation of content by Big Tech in backing proposals that would require AI companies to strike licensing deals with news organizations.
Richard Blumenthal, Democrat of Connecticut and chair of the committee, joined several other senators in supporting calls for a licensing regime and to establish a framework clarifying that intellectual property laws don't protect AI companies using copyrighted material to build their chatbots.
[...] The fight over the legality of AI firms eating content from news organizations without consent or compensation is split into two camps: Those who believe the practice is protected under the "fair use" doctrine in intellectual property law that allows creators to build upon copyrighted works, and those who argue that it constitutes copyright infringement. Courts are currently wrestling with the issue, but an answer to the question is likely years away. In the meantime, AI companies continue to use copyrighted content as training materials, endangering the financial viability of media in a landscape in which readers can bypass direct sources in favor of search results generated by AI tools.
[...] A lawsuit from The New York Times, filed last month, pulled back the curtain behind negotiations over the price and terms of licensing its content. Before suing, it said that it had been talking for months with OpenAI and Microsoft about a deal, though the talks reached no such truce. In the backdrop of AI companies crawling the internet for high-quality written content, news organizations have been backed into a corner, having to decide whether to accept lowball offers to license their content or expend the time and money to sue in a lawsuit. Some companies, like Axel Springer, took the money.
It's important to note that under intellectual property laws, facts are not protected.
https://arstechnica.com/information-technology/2024/02/microsoft-in-deal-with-semafor-to-create-news-stories-with-aid-of-ai-chatbot/
Microsoft is working with media startup Semafor to use its artificial intelligence chatbot to help develop news stories—part of a journalistic outreach that comes as the tech giant faces a multibillion-dollar lawsuit from the New York Times.
As part of the agreement, Microsoft is paying an undisclosed sum of money to Semafor to sponsor a breaking news feed called "Signals." The companies would not share financial details, but the amount of money is "substantial" to Semafor's business, said a person familiar with the matter.
[...] The partnerships come as media companies have become increasingly concerned over generative AI and its potential threat to their businesses. News publishers are grappling with how to use AI to improve their work and stay ahead of technology, while also fearing that they could lose traffic, and therefore revenue, to AI chatbots—which can churn out humanlike text and information in seconds.
The New York Times in December filed a lawsuit against Microsoft and OpenAI, alleging the tech companies have taken a "free ride" on millions of its articles to build their artificial intelligence chatbots, and seeking billions of dollars in damages.
[...] Semafor, which is free to read, is funded by wealthy individuals, including 3G capital founder Jorge Paulo Lemann and KKR co-founder Henry Kravis. The company made more than $10 million in revenue in 2023 and has more than 500,000 subscriptions to its free newsletters. Justin Smith said Semafor was "very close to a profit" in the fourth quarter of 2023.
https://arstechnica.com/tech-policy/2024/02/why-the-new-york-times-might-win-its-copyright-lawsuit-against-openai/
The day after The New York Times sued OpenAI for copyright infringement, the author and systems architect Daniel Jeffries wrote an essay-length tweet arguing that the Times "has a near zero probability of winning" its lawsuit. As we write this, it has been retweeted 288 times and received 885,000 views.
"Trying to get everyone to license training data is not going to work because that's not what copyright is about," Jeffries wrote. "Copyright law is about preventing people from producing exact copies or near exact copies of content and posting it for commercial gain. Period. Anyone who tells you otherwise is lying or simply does not understand how copyright works."
[...] Courts are supposed to consider four factors in fair use cases, but two of these factors tend to be the most important. One is the nature of the use. A use is more likely to be fair if it is "transformative"—that is, if the new use has a dramatically different purpose and character from the original. Judge Rakoff dinged MP3.com as non-transformative because songs were merely "being retransmitted in another medium."
In contrast, Google argued that a book search engine is highly transformative because it serves a very different function than an individual book. People read books to enjoy and learn from them. But a search engine is more like a card catalog; it helps people find books.
The other key factor is how a use impacts the market for the original work. Here, too, Google had a strong argument since a book search engine helps people find new books to buy.
[...] In 2015, the Second Circuit ruled for Google. An important theme of the court's opinion is that Google's search engine was giving users factual, uncopyrightable information rather than reproducing much creative expression from the books themselves.
[...] Recently, we visited Stability AI's website and requested an image of a "video game Italian plumber" from its image model Stable Diffusion.
[...] Clearly, these models did not just learn abstract facts about plumbers—for example, that they wear overalls and carry wrenches. They learned facts about a specific fictional Italian plumber who wears white gloves, blue overalls with yellow buttons, and a red hat with an "M" on the front.
These are not facts about the world that lie beyond the reach of copyright. Rather, the creative choices that define Mario are likely covered by copyrights held by Nintendo.
OpenAI has asked a federal judge to dismiss parts of the New York Times' copyright lawsuit against it, arguing that the newspaper "hacked" its chatbot ChatGPT and other artificial-intelligence systems to generate misleading evidence for the case:
OpenAI said in a filing in Manhattan federal court on Monday that the Times caused the technology to reproduce its material through "deceptive prompts that blatantly violate OpenAI's terms of use."
"The allegations in the Times's complaint do not meet its famously rigorous journalistic standards," OpenAI said. "The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI's products."
OpenAI did not name the "hired gun" who it said the Times used to manipulate its systems and did not accuse the newspaper of breaking any anti-hacking laws.
[...] Courts have not yet addressed the key question of whether AI training qualifies as fair use under copyright law. So far, judges have dismissed some infringement claims over the output of generative AI systems based on a lack of evidence that AI-created content resembles copyrighted works.
(Score: 0) by Anonymous Coward on Wednesday March 13, @03:33PM
First, thanks Jan for combining these.
After reading, LLMs Become More Covertly Racist With Human Intervention, for politically incorrect lulz I wondered if anyone has hooked up something like this as an alternative front end for ChatGPT?
(input text) | Jive_Filter* | ChatGPT
* https://knowyourmeme.com/memes/jive-filters [knowyourmeme.com] (Various different versions have been created)
(Score: 2) by JoeMerchant on Wednesday March 13, @04:03PM
In 1984, I wrote a BBS "user bot" that would sign in a new user account on particular local BBSs that did not require authentication before allowing posting. The bot would then navigate to the message boards and start posting AI looking randomly constructed sentences (various structures like: Noun-Verb-Adverb-Preposition-Definite Article-Noun. Preposition-Definite Article-Noun-Verb-Adverb. etc.) populated from word lists scraped from other messages on the board. Since the BBSs were implemented on floppy drive storage, even on a 300 baud modem you could fill the message storage space rather quickly.
Sysops of the bot targets were philosophically committed to allowing anonymous postings, so they never denied access, and they rarely had enough programming skill to make anything resembling an effective Captcha, but they were committed to monitoring their boards, so they'd often make the modem audible and listen to the activity, even at 3:30am...
So, in typical arms-race fashion, I trained my bot to type the messages in at human-like cadence. Randomized delays between letters, extra pause length after most commas or periods and paragraph breaks. That was surprisingly effective at fooling the sysops into letting the bot-messages get posted. I never did get clever enough to do QUERTY specific delay tuning (less delay between keys on different hands, longer delays for keys handled by the pinkies, etc.) - didn't seem to need to be that clever. Of course, once the sysops got around to reading the bot-posts they'd eventually clue in and delete them, but that was a much longer interval, and when the word lists were populated with regularly seen names and places and other verbiage from "the community" sometimes those posts stayed up for a week or more. I wasn't the only such bot writer, but I think I was the first in our area. Watching the copycats spread was very satisfying.
Point?
>banned all employees from image synthesis rival Stability AI from its service indefinitely after it detected "botnet-like" activity suspected to be a Stability employee
Rookie mistake. 16 year olds 40 years ago learned to mask their bots so they appear like human users. Surely if you are playing in the big leagues for significant money you'd make the effort to "stealth" your bots. I bet they already have, less than a week after the banning.
🌻🌻 [google.com]