Stories
Slash Boxes
Comments

SoylentNews is people

posted by on Sunday November 27 2016, @06:59AM   Printer-friendly
from the one-step-closer-to-HAL-9000 dept.

Arthur T Knackerbracket has found the following story:

Lip-reading is notoriously difficult, depending as much on context and knowledge of language as it does on visual clues. But researchers are showing that machine learning can be used to discern speech from silent video clips more effectively than professional lip-readers can.

In one project, a team from the University of Oxford's Department of Computer Science has developed a new artificial-intelligence system called LipNet. As Quartz reported, its system was built on a data set known as GRID, which is made up of well-lit, face-forward clips of people reading three-second sentences. Each sentence is based on a string of words that follow the same pattern.

The team used that data set to train a neural network, similar to the kind often used to perform speech recognition. In this case, though, the neural network identifies variations in mouth shape over time, learning to link that information to an explanation of what's being said. The AI doesn't analyze the footage in snatches but considers the whole thing, enabling it to gain an understanding of context from the sentence being analyzed. That's important, because there are fewer mouth shapes than there are sounds produced by the human voice.

When tested, the system was able to identify 93.4 percent of words correctly. Human lip-reading volunteers asked to perform the same tasks identified just 52.3 percent of words correctly.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by TGV on Sunday November 27 2016, @08:16AM

    by TGV (2838) on Sunday November 27 2016, @08:16AM (#433584)

    Pretty impressive. Humans are bad at it, we know that. However, the human base line consists of people that can probably apply this skill in the wild, while the program won't fare as well outside the format of the GRID data set. But the most important thing for me is that now we know that there is considerably more information in our facial movement related to speech than we pick up.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 0) by Anonymous Coward on Sunday November 27 2016, @12:37PM

    by Anonymous Coward on Sunday November 27 2016, @12:37PM (#433612)

    the human base line consists of people that can probably apply this skill in the wild, while the program won't fare as well outside the format of the GRID

    So you recommend that people start living off the GRID?

  • (Score: 2) by tonyPick on Sunday November 27 2016, @12:49PM

    by tonyPick (1237) on Sunday November 27 2016, @12:49PM (#433615) Homepage Journal

    However, the human base line consists of people that can probably apply this skill in the wild, while the program won't fare as well outside the format of the GRID data set.

    Yeah - I suspect that the big stumbling block here, as in speech recognition, will become that while people have a relatively poor performance when it comes to specific word recognition the massive win in comprehension is in being able to contextualise the data and decipher the meaning and intent of the speaker more accurately.

    As it turns out you don't have to understand every word for speech and I assume similar rules apply for lip reading; if you have some cues and can pick up on what the subject is trying to say then you can get away with a relatively low accuracy for specific words.