Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
posted by Fnord666 on Monday January 16 2023, @07:56AM   Printer-friendly
from the my-voice-is-no-longer-my-password dept.

Text-to-speech model can preserve speaker's emotional tone and acoustic environment:

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.


Original Submission

 
This discussion was created by Fnord666 (652) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Insightful) by inertnet on Monday January 16 2023, @08:25AM (9 children)

    by inertnet (4071) on Monday January 16 2023, @08:25AM (#1287042) Journal

    I can't think of any useful application, but I can think of many ways to abuse this.

    • (Score: 5, Insightful) by janrinok on Monday January 16 2023, @08:57AM (5 children)

      by janrinok (52) Subscriber Badge on Monday January 16 2023, @08:57AM (#1287043) Journal

      I'll agree that it is easier to think of abusive applications than useful ones, but there are a few that spring to mind. For example, talking books/audio books are popular on smart devices and are extremely useful for those with impaired sight could be produced using well known voices. Perhaps even using an unknown neutral voice which might even reduce the cost of manufacturing - not that we will see any reduction in price!

      Many cartoon-type films - which are currently voiced by well known (and expensive) actors - might also become cheaper to produce, particularly if translated into several languages or more.

      However, in the time it has taken me to write this reply I have probably thought of a dozen or more ways in which it could be abused particularly by the criminal fraternity, or by those wishing to influence an individual's political popularity or to change the outcome of elections.

      This is why we can't have nice things.....

      --
      I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
      • (Score: 0) by Anonymous Coward on Monday January 16 2023, @02:46PM (4 children)

        by Anonymous Coward on Monday January 16 2023, @02:46PM (#1287056)

        "Many cartoon-type films - which are currently voiced by well known (and expensive) actors - might also become cheaper to produce"

        translation: Voice actors will be screwed out a living by something imitating them. It already is an intensely competitive field, and they do maintain rights to the way their voice sounds.

        I can't wait for a not-quite John DiMaggio AI...

        • (Score: 2) by DannyB on Monday January 16 2023, @05:14PM (1 child)

          by DannyB (5839) Subscriber Badge on Monday January 16 2023, @05:14PM (#1287085) Journal

          Don't worry. Soon enough they'll learn to create custom voices not made from a human sample. Use existing systems that simulate the human vocal tract. You don't need to come up with an entire voice any longer. Just a few good seconds. Tune for different desirable voices. Build a catalog of a few dozen good quality voices that can be used for legitimate porpoises. Speech to text. Car navigation systems. Audio books.

          Oh, here is an application. A boss takes the word document he typed, use speech synthesis to prepare a dictation tape. The secretary listens to the tape and types the boss's document nice and neat.

          --
          The Centauri traded Earth jump gate technology in exchange for our superior hair mousse formulas.
          • (Score: 2) by gtomorrow on Tuesday January 17 2023, @12:53PM

            by gtomorrow (2230) on Tuesday January 17 2023, @12:53PM (#1287211)

            Oh, here is an application. A boss takes the word document he typed, use speech synthesis to prepare a dictation tape. The secretary listens to the tape and types the boss's document nice and neat.

            That's ridiculous! Nobody uses tape anymore!

        • (Score: 0) by Anonymous Coward on Monday January 16 2023, @06:58PM

          by Anonymous Coward on Monday January 16 2023, @06:58PM (#1287106)

          Hmm, I always thought his name was Joe

        • (Score: 2) by Mykl on Monday January 16 2023, @10:59PM

          by Mykl (1112) on Monday January 16 2023, @10:59PM (#1287157)

          It might be considered a trade-off if the movie price were reduced to reflect the cheaper cost to produce the film, but we all know what the chances of that are...

    • (Score: 2, Informative) by GloomMower on Monday January 16 2023, @01:25PM (1 child)

      by GloomMower (17961) on Monday January 16 2023, @01:25PM (#1287054)

      It would be nice if I can make voice-overs in my own voice without me saying them. Especially to make it sound less monotone and pronounce words correctly without having to do several takes.

      My voice:
      https://www.youtube.com/watch?v=zndy5BNjf0I [youtube.com]

      Later I used AWS Polly text to speech:
      https://www.youtube.com/watch?v=K7PMrOzxzj0 [youtube.com]

      I thought polly was better than me reading. But it would be nice if I could make a pristine sample of my voice and use text to speech.

      • (Score: 2) by inertnet on Monday January 16 2023, @02:16PM

        by inertnet (4071) on Monday January 16 2023, @02:16PM (#1287055) Journal

        I can see your point, not that your original voice-over is bad though. I usually dislike artificial voice-overs, but that one wasn't so bad.

        I can't object as long as it's used voluntary, but I would really have a problem with people stealing my voice and have me say things that I've never actually spoken.

    • (Score: 2) by takyon on Monday January 16 2023, @04:45PM

      by takyon (881) <{takyon} {at} {soylentnews.org}> on Monday January 16 2023, @04:45PM (#1287077) Journal

      https://screenrant.com/skyrim-bethesda-elder-scrolls-voice-actors-cast-bad/ [screenrant.com]
      https://www.escapistmagazine.com/starfield-is-the-one-game-that-should-use-ai-voice-actors/ [escapistmagazine.com]

      I believe Skyrim uses 48 Kbps audio for voice acting, and I've seen an estimate of 1215 minutes (20.25 hours) which might be the original game without expansions. That gets you to 437.4 megabytes. The original PC game was around 5.7 GB, small by today's standards.

      If you were to instead use text with a markup language, you're probably getting no less than 100:1 compression from text-to-voice (60 bytes for a second, including any markup).

      If the algorithm can synthesize speech in real-time, it can easily be used in video games. Now you're only limited by the amount of scripts you can write... and you can use a technology like Chat-GPT to write more dialog than a human possibly could. You may even be able to do it dynamically, within the game.

      You can also get any player's name inserted into voice lines.

      By the way, this markup could become very colloquial and loosely structured like you have seen with AI prompts, as long as the AI can handle it. [beggar voice]Spare [emphasis]just one[/emphasis] coin for an old beggar?[/] or [panting after running for several minutes, hoarse voice]He- he got away![/]

      These technologies will screw people over, but it's possible that voice actors can still get compensated depending on how scrupulous the companies involved are. For example, voice actors go into a company like Replica Studios and provide voice samples to train AI. Much more than the 3 seconds Microsoft is bragging about here, for better accuracy. The voice actors can come in again in the future to provide different samples, since their voice will change from aging or other factors. Although there will be algorithms to automatically age up voices or add the effects of 7,300 packs of cigarettes to the voice, I'm sure. Voice actors get paid for their time, and get some royalties each time their voice is licensed. I could see game companies licensing out thousands or tens of thousands of voices to provide more variety than what was previously possible. People don't like hearing the same 10-20 voices recycled over and over again for different characters.

      Personality rights [wikipedia.org], along with the contract you agree to, are what would protect your "voice". But there is no national personality right in the U.S. And there is nothing stopping a foreign company from ripping off someone's voice work and simply not distributing where they could be sued.

      Back on the Elder Scrolls theme, people have not left the 21-year-old Morrowind alone. RTX Remix is being used to add raytracing to the game. Someone is definitely going to take the numerous text conversations and use AI to voice act them.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
  • (Score: 4, Informative) by Nuke on Monday January 16 2023, @09:53AM (1 child)

    by Nuke (3162) on Monday January 16 2023, @09:53AM (#1287044)

    "Hi, this is your bank manager, we were speaking in branch last week. You need to transfer $10,000 to a special account we have set up ...."
    or :
    "Hi, your IT tech support here. We have detected a virus on your Windows. As you can tell by my Oxford accent, I am not from India ..."

    • (Score: 2) by DannyB on Monday January 16 2023, @05:16PM

      by DannyB (5839) Subscriber Badge on Monday January 16 2023, @05:16PM (#1287087) Journal

      Hi,
      This is the voice of a rich Nigerian prince who recently died. This voice is from beyond the grave. I spent my entire life trying to give away my substantial fortune by email, but nobody would return my emails!

      --
      The Centauri traded Earth jump gate technology in exchange for our superior hair mousse formulas.
  • (Score: 1, Insightful) by Anonymous Coward on Monday January 16 2023, @01:03PM (1 child)

    by Anonymous Coward on Monday January 16 2023, @01:03PM (#1287053)

    Microsoft trained VALL-E's speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

    Simulate in the sense that it has a whole bunch of pretrained simulated voices and it uses the three seconds to match your voice up to its database. However, the stuff it says after that is based on the pretrained data, not somehow extracted from your short three second clip.

    • (Score: 0) by Anonymous Coward on Monday January 16 2023, @03:26PM

      by Anonymous Coward on Monday January 16 2023, @03:26PM (#1287063)

      Hmm. I wonder...if they don't already have Gilbert Gottfried's annoying voice, is it true that this new thing won't work so well for him?

  • (Score: 2) by PiMuNu on Monday January 16 2023, @07:10PM

    by PiMuNu (3823) on Monday January 16 2023, @07:10PM (#1287109)

    Microsoft's new AI can simulate anyone's voice *badly* with three seconds of audio.

    In other news, I can also simulate anyone's voice with three seconds of audio. I also do a great Homer Simpson impression. Doh!

  • (Score: 0) by Anonymous Coward on Tuesday January 17 2023, @12:42PM

    by Anonymous Coward on Tuesday January 17 2023, @12:42PM (#1287209)

    https://arstechnica.com/information-technology/2016/11/adobe-voco-photoshop-for-audio-speech-editing/ [arstechnica.com]

    So maybe the progress after 7 years is that Microsoft's method only needs 3 seconds.

(1)