from the puttin'-on-the-squeeze dept.
Using AI to compress audio files for quick and easy sharing
Today, we are detailing progress that our Fundamental AI Research (FAIR) team has made in the area of AI-powered hypercompression of audio. Imagine listening to a friend's audio message in an area with low connectivity and not having it stall or glitch. Our research shows how we can use AI to help us achieve this. We built a three-part system and trained it end to end to compress audio data to the size we target. This data can then be decoded using a neural network. We achieve an approximate 10x compression rate compared with MP3 at 64 kbps, without a loss of quality. While such techniques have been explored before for speech, we are the first to make it work for 48 kHz sampled stereo audio (i.e., CD quality), which is the standard for music distribution. We are sharing additional details in a research paper, along with code and samples as part of our commitment to open science.
The new approach can compress and decompress audio in real time to state-of-the-art size reductions. More work needs to be done, but eventually it could lead to improvements such as supporting faster, better-quality calls under poor network conditions and delivering rich metaverse experiences without requiring major bandwidth improvements.
GitHub. Also at Ars Technica.
High Fidelity Neural Audio Compression (arXiv:2210.13438)
(Score: 3, Informative) by FatPhil on Friday November 04, @02:02PM (2 children)
"We achieve an approximate 10x compression rate compared with MP3 at 64 kbps" ... why are you comparing against a full bandwidth encoder, and not even a state-of-the-art one?
Why not compare your 6.4kbps codec with Codec 2? Oh, because "Codec 2 consists of 3200, 2400, 1600, 1400, 1300, 1200, 700 and 450 bit/s codec modes. It outperforms most other low-bitrate speech codecs." -- https://en.wikipedia.org/wiki/Codec2 (And of course there's Opus that operates at low bitrates for low bandwidth signals like speech.)
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 1, Flamebait) by EvilSS on Friday November 04, @04:59PM
(Score: 2) by JoeMerchant on Friday November 04, @06:19PM
All of these "psychoacoustic" encoders rely on making things sound the same, or similar, to a human listener.
The problem is: which human listener are you grading the encoder by? My 70 year old neighbor with the hearing aid prescription (that he rarely uses?) Me (today) with my high frequency sensitivity that falls to zero at 16.12kHz most days? Younger me who was highly sensitive to sound up to 22kHz and higher? (Yes, I was an outlier...)
Not only are there these absolute cutoffs where people just can't perceive any sound at all, there are also varying sensitivities in various frequency bands... How about tinnitus sufferers?
So, if OP is going to use AI to tune a psychoacoustic encoder, the real question becomes: whose psycho is it tuning for?
Україна досі не є частиною Росії. https://www.newsweek.com/russian-state-tv-ukraine-war-dirty-bomb-putin-1754428
(Score: 1, Insightful) by Anonymous Coward on Friday November 04, @02:53PM
This "AI" will run on FB servers, also doing spech to text coversion, hoovering up every conversation going through it.
DO NOT WANT
(Score: 2) by PiMuNu on Friday November 04, @02:54PM (2 children)
... and how does it differ from regular compression?
(Score: 1, Insightful) by Anonymous Coward on Friday November 04, @03:27PM
> how does it differ from regular compression?
That's simple, as soon as Zuck is involved, adjectives all ramp up a notch.
After he uses up "hyper", maybe he'll move on to "meta" (oops!)
(Score: 2) by JoeMerchant on Friday November 04, @07:00PM
Hyper is subliminally a favorite term of Zuck due to his taste in DJ'ed albums:
https://en.wikipedia.org/wiki/We_Control [wikipedia.org]
Україна досі не є частиною Росії. https://www.newsweek.com/russian-state-tv-ukraine-war-dirty-bomb-putin-1754428
(Score: 1) by shrewdsheep on Friday November 04, @03:11PM (2 children)
The architecture looks like a pretty standard auto-encoder. My understanding of modern audio-compressors is that compression rates are achieved by ignoring frequencies/frequncy-patterns based on psycho-acustic models. I do not see any reference to such models on the github page (sorry TLDR). Anyone?
(Score: 0) by Anonymous Coward on Friday November 04, @03:30PM
(sarc) Hey, get with the program, who needs complex psycho-acoustic models when we have AI doing all the work for us.(/sarc)
(Score: 3, Informative) by EvilSS on Friday November 04, @05:13PM
(Score: 2, Funny) by MIRV888 on Friday November 04, @03:49PM
How convenient. Infrastructure is expensive. DSL forever Baby!
(Score: 5, Interesting) by PhilSalkie on Friday November 04, @04:18PM (3 children)
My biggest concern with a codec like this is that it may be possible for the output to sound perfect, but not accurately represent the input. Have a look at issues with the JBIG2 lossy compression algorithm used in document scanners, it wound up doing things like substituting a very clean-looking number "8" in the place of a noisly-scanned "6" - which calls into question literally every scanned document containing numbers (architectural drawings, sheets of financial records, medical records, etc.) - the numbers may not match the original, and there's no way to be sure, especially if the original is no longer available (which is often the case after paper documents are scanned en-masse).
Here's a quick summary, but there's lots more info out there: http://blog.pdf-tools.com/2015/07/is-jbig2-soon-banned.html [pdf-tools.com]
In non-lossy compression, a noisily scanned "6" comes through visually as a noisy image, and it's obvious to a reader that the number "6" could be in question, they can argue that maybe it's an "8" or a "3" or whatever, and thus go check other sources or double-check columns and sums and such. When the resulting JBIG2 lossy decode is complete, the "8" looks perfect, because it's a copy of an "8" from a totally different section of the document - there's no reason for a reader to question it, other than that the bridge fell down or the room is larger than the width of the building it's supposed to be in.
If we start hearing speech through an AI-powered lossy codec, can we be sure that the words we're hearing all clean and nice are actually the words spoken into the compression stage? Back in the day, one of the boring jobs in Network Television engineering was to sit and read from long lists of words and numbers, while someone was on the other end of the (new at the time) digital codec system, writing down everything they heard. Then we got to compare the two lists, and see how much of a "game of telephone" this new-at-the-time technology would be creating - at first it was worrisome, but it kept getting better as the tech improved.
Looks like there's going to be this whole process over again, except I'm afraid nobody will bother doing all that work - they'll just build some software, say "Sounds nice to me", and the codec will start making its own pseudo-random deepfake-type changes to the audio streams we hear, and we'll have no reason to say "I wonder if they actually said that...".
(Score: 0) by Anonymous Coward on Friday November 04, @05:42PM
Mod parent to "11". As I write this it's already at 5, so I can't push it any higher.
For an extreme example, I could see voice compression done by recognizing common words (with some error rate...) and then putting them in the data stream as text/characters. On playback, text-to-voice reproduces the word that the system thought it recognized during "compression". Perhaps the playback/expansion includes some added "hinting" to make it sound like the original speaker.
(Score: 2) by JoeMerchant on Friday November 04, @07:07PM
>substituting a very clean-looking number "8" in the place of a noisly-scanned "6" - which calls into question literally every scanned document containing numbers
Humans have been doing this forever, no need for OCR or other tech.
I bought a piece of land with a survey where the draftsman had substituted headings of 89deg 38min 57sec for every instance of 89deg 38min 27sec on the page. Researching neighboring deeds, it became clear that 89deg 38min 27sec was the correct heading - clerks skilled in the art refer to that as a "scrivner's error." The licensed surveyor who signed off on the mistake looked at all my research and evidence, shrugged and said: "We'll re-survey for you for our standard fee." Yeah, no thanks, there happens to be another surveyor here in town with a better reputation who will do an actual survey for me for the same price.
Україна досі не є частиною Росії. https://www.newsweek.com/russian-state-tv-ukraine-war-dirty-bomb-putin-1754428
(Score: 2) by JoeMerchant on Friday November 04, @07:14PM
>When the resulting JBIG2 lossy decode is complete, the "8" looks perfect, because it's a copy of an "8" from a totally different section of the document
I worked with polygraph lung volume monitors for a number of years. The tech my company produced used virtually no filtering, we showed the raw data from our sensors directly in our output. The competing tech produced a "signal related to respiration" which was highly filtered and sort of looked like our signals, but smoother, not calibrate-able to actual inspired/expired volumes of air, and they would occasionally invert for various reasons. They would also indicate respiration in response to respiratory efforts, even when no air exchange was taking place - a generation of infant apnea monitors in the 1990s used that tech and recorded a number of "death traces" where the alarm didn't sound until cardiac arrest had set in because there were continued respiratory efforts masking the actual apnea that was causing the infant to suffocate. Thing is: without those filters, their signals looked like garbage - very difficult to interpret and you wouldn't call them "breath waveforms" at all. With the filters, just about everything that came out of the filters did look like a breath waveform. Not terribly useful, in the final analysis.
Our tech had its issues, loose sensors etc. but... when we had a problem, you could diagnose it by looking at the output data, all that unfiltered bandwidth coming through contained a lot of information which usually exposed whatever was going on with the sensors - whether it was coming from the patient's breathing or something else.
Україна досі не є частиною Росії. https://www.newsweek.com/russian-state-tv-ukraine-war-dirty-bomb-putin-1754428
(Score: 2) by takyon on Friday November 04, @06:34PM
Which joker put the "Meta" nexus on this story?
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]