Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
posted by chromas on Tuesday September 25 2018, @07:43PM   Printer-friendly
from the musical-chairs dept.

In this article the authors introduce . . .

PixelPlayer, a system that, by watching large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision.

The system is trained with a large number of videos containing people playing instruments in different combinations, including solos and duets. No supervision is provided on what instruments are present on each video, where they are located, or how they sound. During test time, the input to the system is a video showing people playing different instruments, and the mono auditory input. Our system performs audio-visual source separation and localization, splitting the input sound signal into N sound channels, each one corresponding to a different instrument category. In addition, the system can localize the sounds and assign a different audio wave to each pixel in the input video.

A video is included along with an explanation of several interesting demos, such as pointing at any pixel to hear the sound from that pixel. Or remixing the volume levels of different musical instruments in the video.

The paper is included along with the data set. It says the code is coming soon.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Interesting) by FatPhil on Tuesday September 25 2018, @08:44PM (1 child)

    by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Tuesday September 25 2018, @08:44PM (#739860) Homepage
    It seems it only takes a flat 2D video as input, and mono sound, and then splits on frequencies. Wouldn't it be better to have 2 or 3 cameras, and 2 or 3 microphones, then build a 3D spacial image of the places of movement correlated to the sounds that change there using techniques akin to interferometry. Then you could select the sounds, at all frequencies, that appear to be sourced from near a point in 3D space, and isolate/remove/whatever them?
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 2) by BsAtHome on Tuesday September 25 2018, @09:07PM

      by BsAtHome (889) on Tuesday September 25 2018, @09:07PM (#739875)

      That would be technically sound. However, the currently heard hype is neural networks. No money for simple engineering of waves. This is the buzz of played words that has to be split into a network of big computers using more power achieving less. That is where we hear the bells ring.

      Now, picture that!

  • (Score: 4, Informative) by ledow on Tuesday September 25 2018, @09:20PM (1 child)

    by ledow (5567) on Tuesday September 25 2018, @09:20PM (#739878) Homepage

    If could have implications for post-editing, for removing the background from live performances, to even synchronising audio and video visually.

    If it wasn't actually quite bad at doing what it does.

  • (Score: 0, Informative) by Anonymous Coward on Tuesday September 25 2018, @10:37PM (1 child)

    by Anonymous Coward on Tuesday September 25 2018, @10:37PM (#739908)

    AI is really good at that, I hear. Maybe they can tell it to listen for pixels that say "Allah Akbar Jihad Jihad".

    • (Score: 2) by DannyB on Wednesday September 26 2018, @01:59PM

      by DannyB (5839) Subscriber Badge on Wednesday September 26 2018, @01:59PM (#740179) Journal

      For your purposes, you could train an AI to recognize terrorists by their skin color.

      --
      Infinity is clearly an even number since the next higher number is odd.
  • (Score: 2) by crafoo on Tuesday September 25 2018, @10:45PM

    by crafoo (6639) on Tuesday September 25 2018, @10:45PM (#739910)

    So is this just bandpass filters and basic image recognition? huh

  • (Score: 0) by Anonymous Coward on Tuesday September 25 2018, @11:03PM

    by Anonymous Coward on Tuesday September 25 2018, @11:03PM (#739922)

    So can it generate the video that would produce a sound track? Ala winamp visualizations?
    Or maybe generate the audio that corresponds to a video?

  • (Score: 0) by Anonymous Coward on Tuesday September 25 2018, @11:06PM

    by Anonymous Coward on Tuesday September 25 2018, @11:06PM (#739925)
  • (Score: 4, Insightful) by fishybell on Tuesday September 25 2018, @11:53PM (2 children)

    by fishybell (3156) on Tuesday September 25 2018, @11:53PM (#739944)

    I feel like what was missing (didn't RTFA, but did WTFV) is picking out the sound of a person talking and isolate that. Even if all it did was drown out background noise, then we'd have a winner. Adding in existing filtering technologies, and we're talking not just about a potentially profitable product, but also a scary big brother tech that could isolate sounds in a crowd.

    • (Score: 2) by mhajicek on Wednesday September 26 2018, @12:31AM (1 child)

      by mhajicek (51) on Wednesday September 26 2018, @12:31AM (#739957)

      If I could take a YouTube video and remove the annoying music playing too loudly over the person talking that would be great. Or take a TV show or movie and cut the volume of the music and sound effects, and boost the dialogue so I can actually understand what's being said.

      --
      The spacelike surfaces of time foliations can have a cusp at the surface of discontinuity. - P. Hajicek
      • (Score: 1, Interesting) by Anonymous Coward on Wednesday September 26 2018, @03:33PM

        by Anonymous Coward on Wednesday September 26 2018, @03:33PM (#740249)

        Sometimes also the opposite applies: Music played in the background of some talking, and you like the music, and would like to remove the talking without affecting the quality of the music, so you can enjoy the music in isolation.

(1)