Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Monday January 16 2023, @07:56AM   Printer-friendly
from the my-voice-is-no-longer-my-password dept.

Text-to-speech model can preserve speaker's emotional tone and acoustic environment:

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3.


Original Submission

 
This discussion was created by Fnord666 (652) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Insightful) by Anonymous Coward on Monday January 16 2023, @01:03PM (1 child)

    by Anonymous Coward on Monday January 16 2023, @01:03PM (#1287053)

    Microsoft trained VALL-E's speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data.

    Simulate in the sense that it has a whole bunch of pretrained simulated voices and it uses the three seconds to match your voice up to its database. However, the stuff it says after that is based on the pretrained data, not somehow extracted from your short three second clip.

    Starting Score:    0  points
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  

    Total Score:   1  
  • (Score: 0) by Anonymous Coward on Monday January 16 2023, @03:26PM

    by Anonymous Coward on Monday January 16 2023, @03:26PM (#1287063)

    Hmm. I wonder...if they don't already have Gilbert Gottfried's annoying voice, is it true that this new thing won't work so well for him?