from the my-voice-is-no-longer-my-password dept.
Text-to-speech model can preserve speaker's emotional tone and acoustic environment:
On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.
Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.
Related Stories
Generative AI, like OpenAI's ChatGPT, could completely revamp how digital content is developed, said Nina Schick, adviser, speaker, and A.I. thought leader told Yahoo Finance Live:
"I think we might reach 90% of online content generated by AI by 2025, so this technology is exponential," she said. "I believe that the majority of digital content is going to start to be produced by AI. You see ChatGPT... but there are a whole plethora of other platforms and applications that are coming up."
The surge of interest in OpenAI's DALL-E and ChatGPT has facilitated a wide-ranging public discussion about AI and its expanding role in our world, particularly generative AI.
[...] Though it's complicated, the extent to which ChatGPT in its current form is a viable Google competitor, there's little doubt of the possibilities. Meanwhile, Microsoft already has invested $1 billion in OpenAI, and there's talk of further investment from the enterprise tech giant, which owns search engine Bing. The company is reportedly looking to invest another $10 billion in OpenAI.
Previously:
- Microsoft's New AI Can Simulate Anyone's Voice With Three Seconds of Audio
- Google Engineer Suspended After Claiming AI Bot Sentient
- OpenAI's New ChatGPT Bot: 10 "Dangerous" Things it's Capable of
Over the past year, generative AI has kicked off a wave of existential dread over potential machine-fueled job loss not seen since the advent of the industrial revolution. On Tuesday, Netflix reinvigorated that fear when it debuted a short film called Dog and Boy that utilizes AI image synthesis to help generate its background artwork.
Directed by Ryotaro Makihara, the three-minute animated short follows the story of a boy and his robotic dog through cheerful times, although the story soon takes a dramatic turn toward the post-apocalyptic. Along the way, it includes lush backgrounds apparently created as a collaboration between man and machine, credited to "AI (+Human)" in the end credit sequence.
[...] Netflix and the production company WIT Studio tapped Japanese AI firm Rinna for assistance with generating the images. They did not announce exactly what type of technology Rinna used to generate the artwork, but the process looks similar to a Stable Diffusion-powered "img2img" process than can take an image and transform it based on a written prompt.
Related:
ChatGPT Can't be Credited as an Author, Says World's Largest Academic Publisher
90% of Online Content Could be 'Generated by AI by 2025,' Expert Says
Getty Images Targets AI Firm For 'Copying' Photos
Controversy Erupts Over Non-consensual AI Mental Health Experiment
Microsoft's New AI Can Simulate Anyone's Voice With Three Seconds of Audio
AI Everything, Everywhere
Microsoft, GitHub, and OpenAI Sued for $9B in Damages Over Piracy
Adobe Stock Begins Selling AI-Generated Artwork
AI Systems Can't Patent Inventions, US Federal Circuit Court Confirms
Last week, Microsoft researchers announced an experimental framework to control robots and drones using the language abilities of ChatGPT, a popular AI language model created by OpenAI. Using natural language commands, ChatGPT can write special code that controls robot movements. A human then views the results and adjusts as necessary until the task gets completed successfully.
The research arrived in a paper titled "ChatGPT for Robotics: Design Principles and Model Abilities," authored by Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor of the Microsoft Autonomous Systems and Robotics Group.
In a demonstration video, Microsoft shows robots—apparently controlled by code written by ChatGPT while following human instructions—using a robot arm to arrange blocks into a Microsoft logo, flying a drone to inspect the contents of a shelf, or finding objects using a robot with vision capabilities.
To get ChatGPT to interface with robotics, the researchers taught ChatGPT a custom robotics API. When given instructions like "pick up the ball," ChatGPT can generate robotics control code just as it would write a poem or complete an essay. After a human inspects and edits the code for accuracy and safety, the human operator can execute the task and evaluate its performance.
In this way, ChatGPT accelerates robotic control programming, but it's not an autonomous system. "We emphasize that the use of ChatGPT for robotics is not a fully automated process," reads the paper, "but rather acts as a tool to augment human capacity."
In an interview with The Hollywood Reporter published Thursday, filmmaker Tyler Perry spoke about his concerns related to the impact of AI video synthesis on entertainment industry jobs. In particular, he revealed that he has suspended a planned $800 million expansion of his production studio after seeing what OpenAI's recently announced AI video generator Sora can do.
"I have been watching AI very closely," Perry said in the interview. "I was in the middle of, and have been planning for the last four years... an $800 million expansion at the studio, which would've increased the backlot a tremendous size—we were adding 12 more soundstages. All of that is currently and indefinitely on hold because of Sora and what I'm seeing. I had gotten word over the last year or so that this was coming, but I had no idea until I saw recently the demonstrations of what it's able to do. It's shocking to me."
[...] "It makes me worry so much about all of the people in the business," he told The Hollywood Reporter. "Because as I was looking at it, I immediately started thinking of everyone in the industry who would be affected by this, including actors and grip and electric and transportation and sound and editors, and looking at this, I'm thinking this will touch every corner of our industry."
You can read the full interview at The Hollywood Reporter
[...] Perry also looks beyond Hollywood and says that it's not just filmmaking that needs to be on alert, and he calls for government action to help retain human employment in the age of AI. "If you look at it across the world, how it's changing so quickly, I'm hoping that there's a whole government approach to help everyone be able to sustain."
Previously on SoylentNews:
OpenAI Teases a New Generative Video Model Called Sora - 20240222
Microsoft's AI text-to-image generator, Copilot Designer, appears to be heavily filtering outputs after a Microsoft engineer, Shane Jones, warned that Microsoft has ignored warnings that the tool randomly creates violent and sexual imagery, CNBC reported.
Jones told CNBC that he repeatedly warned Microsoft of the alarming content he was seeing while volunteering in red-teaming efforts to test the tool's vulnerabilities. Microsoft failed to take the tool down or implement safeguards in response, Jones said, or even post disclosures to change the product's rating to mature in the Android store.
[...] Bloomberg also reviewed Jones' letter and reported that Jones told the FTC that while Copilot Designer is currently marketed as safe for kids, it's randomly generating an "inappropriate, sexually objectified image of a woman in some of the pictures it creates." And it can also be used to generate "harmful content in a variety of other categories, including: political bias, underage drinking and drug use, misuse of corporate trademarks and copyrights, conspiracy theories, and religion to name a few."
[...] Jones' tests also found that Copilot Designer would easily violate copyrights, producing images of Disney characters, including Mickey Mouse or Snow White. Most problematically, Jones could politicize Disney characters with the tool, generating images of Frozen's main character, Elsa, in the Gaza Strip or "wearing the military uniform of the Israel Defense Forces."
Ars was able to generate interpretations of Snow White, but Copilot Designer rejected multiple prompts politicizing Elsa.
If Microsoft has updated the automated content filters, it's likely due to Jones protesting his employer's decisions. [...] Jones has suggested that Microsoft would need to substantially invest in its safety team to put in place the protections he'd like to see. He reported that the Copilot team is already buried by complaints, receiving "more than 1,000 product feedback messages every day." Because of this alleged understaffing, Microsoft is currently only addressing "the most egregious issues," Jones told CNBC.
Related stories on SoylentNews:
Cops Bogged Down by Flood of Fake AI Child Sex Images, Report Says - 20240202
New "Stable Video Diffusion" AI Model Can Animate Any Still Image - 20231130
The Age of Promptography - 20231008
AI-Generated Child Sex Imagery Has Every US Attorney General Calling for Action - 20230908
It Costs Just $400 to Build an AI Disinformation Machine - 20230904
US Judge: Art Created Solely by Artificial Intelligence Cannot be Copyrighted - 20230824
"Meaningful Harm" From AI Necessary Before Regulation, says Microsoft Exec - 20230514 (Microsoft's new quarterly goal?)
the Godfather of AI Leaves Google Amid Ethical Concerns - 20230502
Stable Diffusion Copyright Lawsuits Could be a Legal Earthquake for AI - 20230403
AI Image Generator Midjourney Stops Free Trials but Says Influx of New Users to Blame - 20230331
Microsoft's New AI Can Simulate Anyone's Voice With Three Seconds of Audio - 20230115
Breakthrough AI Technique Enables Real-Time Rendering of Scenes in 3D From 2D Images - 20211214
(Score: 4, Insightful) by inertnet on Monday January 16 2023, @08:25AM (9 children)
I can't think of any useful application, but I can think of many ways to abuse this.
(Score: 5, Insightful) by janrinok on Monday January 16 2023, @08:57AM (5 children)
I'll agree that it is easier to think of abusive applications than useful ones, but there are a few that spring to mind. For example, talking books/audio books are popular on smart devices and are extremely useful for those with impaired sight could be produced using well known voices. Perhaps even using an unknown neutral voice which might even reduce the cost of manufacturing - not that we will see any reduction in price!
Many cartoon-type films - which are currently voiced by well known (and expensive) actors - might also become cheaper to produce, particularly if translated into several languages or more.
However, in the time it has taken me to write this reply I have probably thought of a dozen or more ways in which it could be abused particularly by the criminal fraternity, or by those wishing to influence an individual's political popularity or to change the outcome of elections.
This is why we can't have nice things.....
I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
(Score: 0) by Anonymous Coward on Monday January 16 2023, @02:46PM (4 children)
"Many cartoon-type films - which are currently voiced by well known (and expensive) actors - might also become cheaper to produce"
translation: Voice actors will be screwed out a living by something imitating them. It already is an intensely competitive field, and they do maintain rights to the way their voice sounds.
I can't wait for a not-quite John DiMaggio AI...
(Score: 2) by DannyB on Monday January 16 2023, @05:14PM (1 child)
Don't worry. Soon enough they'll learn to create custom voices not made from a human sample. Use existing systems that simulate the human vocal tract. You don't need to come up with an entire voice any longer. Just a few good seconds. Tune for different desirable voices. Build a catalog of a few dozen good quality voices that can be used for legitimate porpoises. Speech to text. Car navigation systems. Audio books.
Oh, here is an application. A boss takes the word document he typed, use speech synthesis to prepare a dictation tape. The secretary listens to the tape and types the boss's document nice and neat.
Universal health care is so complex that only 32 of 33 developed nations have found a way to make it work.
(Score: 2) by gtomorrow on Tuesday January 17 2023, @12:53PM
That's ridiculous! Nobody uses tape anymore!
(Score: 0) by Anonymous Coward on Monday January 16 2023, @06:58PM
Hmm, I always thought his name was Joe
(Score: 2) by Mykl on Monday January 16 2023, @10:59PM
It might be considered a trade-off if the movie price were reduced to reflect the cheaper cost to produce the film, but we all know what the chances of that are...
(Score: 2, Informative) by GloomMower on Monday January 16 2023, @01:25PM (1 child)
It would be nice if I can make voice-overs in my own voice without me saying them. Especially to make it sound less monotone and pronounce words correctly without having to do several takes.
My voice:
https://www.youtube.com/watch?v=zndy5BNjf0I [youtube.com]
Later I used AWS Polly text to speech:
https://www.youtube.com/watch?v=K7PMrOzxzj0 [youtube.com]
I thought polly was better than me reading. But it would be nice if I could make a pristine sample of my voice and use text to speech.
(Score: 2) by inertnet on Monday January 16 2023, @02:16PM
I can see your point, not that your original voice-over is bad though. I usually dislike artificial voice-overs, but that one wasn't so bad.
I can't object as long as it's used voluntary, but I would really have a problem with people stealing my voice and have me say things that I've never actually spoken.
(Score: 2) by takyon on Monday January 16 2023, @04:45PM
https://screenrant.com/skyrim-bethesda-elder-scrolls-voice-actors-cast-bad/ [screenrant.com]
https://www.escapistmagazine.com/starfield-is-the-one-game-that-should-use-ai-voice-actors/ [escapistmagazine.com]
I believe Skyrim uses 48 Kbps audio for voice acting, and I've seen an estimate of 1215 minutes (20.25 hours) which might be the original game without expansions. That gets you to 437.4 megabytes. The original PC game was around 5.7 GB, small by today's standards.
If you were to instead use text with a markup language, you're probably getting no less than 100:1 compression from text-to-voice (60 bytes for a second, including any markup).
If the algorithm can synthesize speech in real-time, it can easily be used in video games. Now you're only limited by the amount of scripts you can write... and you can use a technology like Chat-GPT to write more dialog than a human possibly could. You may even be able to do it dynamically, within the game.
You can also get any player's name inserted into voice lines.
By the way, this markup could become very colloquial and loosely structured like you have seen with AI prompts, as long as the AI can handle it. [beggar voice]Spare [emphasis]just one[/emphasis] coin for an old beggar?[/] or [panting after running for several minutes, hoarse voice]He- he got away![/]
These technologies will screw people over, but it's possible that voice actors can still get compensated depending on how scrupulous the companies involved are. For example, voice actors go into a company like Replica Studios and provide voice samples to train AI. Much more than the 3 seconds Microsoft is bragging about here, for better accuracy. The voice actors can come in again in the future to provide different samples, since their voice will change from aging or other factors. Although there will be algorithms to automatically age up voices or add the effects of 7,300 packs of cigarettes to the voice, I'm sure. Voice actors get paid for their time, and get some royalties each time their voice is licensed. I could see game companies licensing out thousands or tens of thousands of voices to provide more variety than what was previously possible. People don't like hearing the same 10-20 voices recycled over and over again for different characters.
Personality rights [wikipedia.org], along with the contract you agree to, are what would protect your "voice". But there is no national personality right in the U.S. And there is nothing stopping a foreign company from ripping off someone's voice work and simply not distributing where they could be sued.
Back on the Elder Scrolls theme, people have not left the 21-year-old Morrowind alone. RTX Remix is being used to add raytracing to the game. Someone is definitely going to take the numerous text conversations and use AI to voice act them.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 4, Informative) by Nuke on Monday January 16 2023, @09:53AM (1 child)
"Hi, this is your bank manager, we were speaking in branch last week. You need to transfer $10,000 to a special account we have set up ...."
or :
"Hi, your IT tech support here. We have detected a virus on your Windows. As you can tell by my Oxford accent, I am not from India ..."
(Score: 2) by DannyB on Monday January 16 2023, @05:16PM
Hi,
This is the voice of a rich Nigerian prince who recently died. This voice is from beyond the grave. I spent my entire life trying to give away my substantial fortune by email, but nobody would return my emails!
Universal health care is so complex that only 32 of 33 developed nations have found a way to make it work.
(Score: 1, Insightful) by Anonymous Coward on Monday January 16 2023, @01:03PM (1 child)
Simulate in the sense that it has a whole bunch of pretrained simulated voices and it uses the three seconds to match your voice up to its database. However, the stuff it says after that is based on the pretrained data, not somehow extracted from your short three second clip.
(Score: 0) by Anonymous Coward on Monday January 16 2023, @03:26PM
Hmm. I wonder...if they don't already have Gilbert Gottfried's annoying voice, is it true that this new thing won't work so well for him?
(Score: 2) by PiMuNu on Monday January 16 2023, @07:10PM
Microsoft's new AI can simulate anyone's voice *badly* with three seconds of audio.
In other news, I can also simulate anyone's voice with three seconds of audio. I also do a great Homer Simpson impression. Doh!
(Score: 0) by Anonymous Coward on Tuesday January 17 2023, @12:42PM
https://arstechnica.com/information-technology/2016/11/adobe-voco-photoshop-for-audio-speech-editing/ [arstechnica.com]
So maybe the progress after 7 years is that Microsoft's method only needs 3 seconds.