Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday December 29 2017, @06:35AM   Printer-friendly
from the this-will-be-the-voice-of-skynet dept.

A research paper published by Google this month—which has not been peer reviewed—details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.

The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.

[...] The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence.

[...] Unlike some core AI research the company does, this technology is immediately useful to Google. WaveNet, first announced in 2016, is now used to generate the voice in Google Assistant. Once readied for production, Tacotron 2 could be an even more powerful addition to the service.

However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1, Funny) by Anonymous Coward on Friday December 29 2017, @06:54AM

    by Anonymous Coward on Friday December 29 2017, @06:54AM (#615474)

    Your phone has virus making your battery slow. Please to be signing up for more targeted advertising to win a free battery replacement.

  • (Score: -1, Troll) by Anonymous Coward on Friday December 29 2017, @07:07AM (2 children)

    by Anonymous Coward on Friday December 29 2017, @07:07AM (#615480)

    Female voices make me want to kill every woman.

    • (Score: 0) by Anonymous Coward on Friday December 29 2017, @07:52AM

      by Anonymous Coward on Friday December 29 2017, @07:52AM (#615486)

      Red Pillar detected! Penisectomy indicated. Proceed?

    • (Score: 2) by LoRdTAW on Friday December 29 2017, @02:06PM

      by LoRdTAW (3755) on Friday December 29 2017, @02:06PM (#615512) Journal
  • (Score: 2) by MostCynical on Friday December 29 2017, @09:35AM (1 child)

    by MostCynical (2589) on Friday December 29 2017, @09:35AM (#615490) Journal

    have they tested how *annoying* the voice is?

    --
    "I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
  • (Score: 1, Insightful) by Anonymous Coward on Friday December 29 2017, @10:33AM (3 children)

    by Anonymous Coward on Friday December 29 2017, @10:33AM (#615494)

    Once you have the spectrogram, output is defined. What is wavenet really doing?

    • (Score: 2) by crafoo on Friday December 29 2017, @03:36PM (2 children)

      by crafoo (6639) on Friday December 29 2017, @03:36PM (#615535)

      Vocalizing the audio in any voice you would like. I don't think the spectrogram is 100% actual voice data ready to send to an output audio buffer.

      • (Score: 0) by Anonymous Coward on Saturday December 30 2017, @05:20AM (1 child)

        by Anonymous Coward on Saturday December 30 2017, @05:20AM (#615751)

        A spectrogram is precisely that: audio data. Period.

        It's a lot more likely that what they have is a spectrographic skeleton for the voice, on which they still need to run transforms to actually have audio data.

        In other words, two layers: first turn morphemes into a skeleton, then turn those morphemes into an actual spectrogram with a translation function (basically, the next neural net).

        • (Score: 2) by darkfeline on Thursday January 04 2018, @05:09AM

          by darkfeline (1030) on Thursday January 04 2018, @05:09AM (#617523) Homepage

          A spectrogram is not a waveform. I suspect that if you try to do that conversion naively, what you get does not sound natural at all. Most human voices do not include white noise.

          --
          Join the SDF Public Access UNIX System today!
  • (Score: 0) by Anonymous Coward on Friday December 29 2017, @02:16PM (2 children)

    by Anonymous Coward on Friday December 29 2017, @02:16PM (#615518)

    But still cannot pronounce my name right. Some great fake AI there.

    • (Score: 0) by Anonymous Coward on Friday December 29 2017, @02:26PM (1 child)

      by Anonymous Coward on Friday December 29 2017, @02:26PM (#615522)

      AC is a very common name (here on SN), I suspect "she" can pronounce it perfectly.

      • (Score: 0) by Anonymous Coward on Saturday December 30 2017, @06:44PM

        by Anonymous Coward on Saturday December 30 2017, @06:44PM (#615913)

        You forget a spelling of a name is not same as saying it. I can spell my name as brown and say it at smith. But that is just a simple example. My last name was 3 pronounations.

  • (Score: 2) by crafoo on Friday December 29 2017, @03:38PM (1 child)

    by crafoo (6639) on Friday December 29 2017, @03:38PM (#615536)

    Will they publish their NN datasets? Are they using TensorFlow or some modified version? How soon until I can buy a box of "voice chips"?

    • (Score: 2) by fyngyrz on Friday December 29 2017, @10:59PM

      by fyngyrz (6567) on Friday December 29 2017, @10:59PM (#615684) Journal

      How soon until I can buy a box of "voice chips"?

      Here you go. [ebay.com]

      No need to thank me, I was here anyway. :)

  • (Score: 2) by donkeyhotay on Friday December 29 2017, @04:39PM (2 children)

    by donkeyhotay (2540) on Friday December 29 2017, @04:39PM (#615556)

    I'll grant that the results are pretty good, however, I had no trouble distinguishing the humans from the AI voices, even though in each case there was just one sentence. I suspect if there were an entire paragraph of speech it would become even more obvious.

    • (Score: 3, Informative) by takyon on Friday December 29 2017, @05:23PM (1 child)

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday December 29 2017, @05:23PM (#615569) Journal

      It's pretty damn good compared to voice assistants that are in use or Daniel (UK) or whatever.

      Let's hope this can be used with stuff like Mycroft [mycroft.ai], Jasper [github.io], or Lucida [lucida.ai].

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 0) by Anonymous Coward on Saturday December 30 2017, @05:22AM

        by Anonymous Coward on Saturday December 30 2017, @05:22AM (#615753)

        Nah, this will go for ELIZA.

        Or, for a few people hanging around here, DOCTOR.

(1)