Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 11 submissions in the queue.
posted by janrinok on Tuesday April 07, @08:28PM   Printer-friendly

https://techtoday.co/googles-new-compression-drastically-shrinks-ai-memory-use-while-quietly-speeding-up-performance-across-demanding-workloads-and-modern-hardware-environments/

As models scale, this memory demand becomes increasingly difficult to manage without compromising speed or accessibility in modern LLM deployments. Traditional approaches attempt to reduce this burden through quantization, a method that compresses numerical precision. However, these techniques often introduce trade-offs, particularly reduced output quality or additional memory overhead from stored constants.

This tension between efficiency and accuracy remains unresolved in many existing systems that rely on AI tools for large-scale processing.

Google’s TurboQuant introduces a two-stage process intended to address these long-standing limitations.

The first stage relies on PolarQuant, which transforms vectors from standard Cartesian coordinates into polar representations. Instead of storing multiple directional components, the system condenses information into radius and angle values, creating a compact shorthand, reducing the need for repeated normalization steps and limits the overhead that typically accompanies conventional quantization methods.

The second stage applies Quantized Johnson-Lindenstrauss, or QJL, which functions as a corrective layer. While PolarQuant handles most of the compression, it can leave small residual errors, as QJL reduces each vector element to a single bit, either positive or negative, while preserving essential relationships between data points.

This additional step refines attention scores, which determine how models prioritize information during processing.

According to reported testing, TurboQuant achieves efficiency gains across several long-context benchmarks using open models.

The system reportedly reduces key-value cache memory usage by a factor of six while maintaining consistent downstream results. It also enables quantization to as little as three bits without requiring retraining, which suggests compatibility with existing model architectures.

The reported results also include gains in processing speed, with attention computations running up to eight times faster than standard 32-bit operations on high-end hardware. These results indicate that compression does not necessarily degrade performance under controlled conditions, although such outcomes depend on benchmark design and evaluation scope.

This system could also lower operation costs by reducing memory demands, while making it easier to deploy models on constrained devices where processing resources remain limited. At the same time, freed resources may instead be redirected toward running more complex models, rather than reducing infrastructure demands.

While the reported results appear consistent across multiple tests, they remain tied to specific experimental conditions. The broader impact will depend on real-world implementation, where variability in workloads and architectures may produce different outcomes.


Original Submission

This discussion was created by janrinok (52) for logged-in users only. Log in and try again!
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2) by JoeMerchant on Tuesday April 07, @08:36PM (3 children)

    by JoeMerchant (3937) on Tuesday April 07, @08:36PM (#1439204)

    If you're working an extensive AI problem, keeping your working documents as .md markdown (or plain .txt) files is effectively a form of content-info compression, as compared to .docx or .html or .pdf or whatever. The (good) LLM agent harnesses can work with all those different formats, and if having a .pdf or whatever as the output format tickles your fancy, just ask, it can do... but, don't ask the LLM to use the more "puffed up" format documents as input sources or references, you're basically wasting context window on parsing of the format-fluff - hsve a separate agent translate them to .md (or .txt) first, then feed them to the problem solver agent.

    .md format "speaks to me" more clearly than plain .txt - .md formatting communicates more clearly and easily to the human in the loop, so I catch more LLM hallucinations more quickly than I would looking at a scroll of 40x25 amber flat-text on a 9" monitor, or even a .txt file in a "good" editor on a big screen.

    --
    🌻🌻🌻🌻 [google.com]
    • (Score: 3, Funny) by jb on Wednesday April 08, @07:48AM (2 children)

      by jb (338) on Wednesday April 08, @07:48AM (#1439245)

      ...so I catch more LLM hallucinations more quickly than I would looking at a scroll of 40x25 amber flat-text...

      40 column mode? Wow. Please don't tell me you've hooked an LLM up to your Commodore 64. That would be outright sacrilege!

      • (Score: 0) by Anonymous Coward on Wednesday April 08, @09:25AM

        by Anonymous Coward on Wednesday April 08, @09:25AM (#1439257)

        If it was 40 column amber it would probably be a pre-1983 Apple ][ although that had 24 lines, or maybe a BBC Micro. The C64 was typically equipped with a color monitor or connected to a TV. (The TRS-80 had a gray phosphor as well as a different screen width and the PET had green). PCs usually had 80 column displays. Even the BBC Micro typically would have used 80 column text if connected to a monochrome monitor.

        I have sometimes thought that programming retrocomputers might be a good test for AI intelligence. The hardware and assembly instructions are well documented but there's not a ton of actual program source code to train on. If the AI is actually doing any thinking, it should be able to write programs for 8 bit micros just as well as it can with HTML.

      • (Score: 2) by JoeMerchant on Wednesday April 08, @12:57PM

        by JoeMerchant (3937) on Wednesday April 08, @12:57PM (#1439277)

        The screens on the system that I replaced in my first job out of school were 9" amber monitors displaying text. They gave way to 386 PCs with 15" VGA displays - luxury!

        A lot of what scrolls by in AI "thinking mode" isn't too different from 1980s text mode outputs.

        --
        🌻🌻🌻🌻 [google.com]
  • (Score: 3, Informative) by AnonTechie on Tuesday April 07, @08:49PM (2 children)

    by AnonTechie (2275) on Tuesday April 07, @08:49PM (#1439205) Journal

    I already submitted this: Google Unveils TurboQuant, a New AI Memory Compression Algorithm https://soylentnews.org/article.pl?sid=26/03/28/0349246 [soylentnews.org]

    --
    Albert Einstein - "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former."
    • (Score: 4, Informative) by janrinok on Tuesday April 07, @09:30PM (1 child)

      by janrinok (52) Subscriber Badge on Tuesday April 07, @09:30PM (#1439208) Journal

      In which case I must apologise for posting a dupe. It didn't ring a bell with me but I must confess that I have had been very distracted by real life (or, more accurately, a real death) over the last few weeks.

      At least your submission was accepted and not ignored.

      Again - sorry.

      --
      [nostyle RIP 06 May 2025]
      • (Score: 2) by AnonTechie on Wednesday April 08, @08:31PM

        by AnonTechie (2275) on Wednesday April 08, @08:31PM (#1439340) Journal

        We all make mistakes, so there is no need to be sorry. Such things happen ... My heartfelt condolences, and I hope your situation improves quickly !!

        --
        Albert Einstein - "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former."
  • (Score: 3, Touché) by JamesWebb on Wednesday April 08, @12:51AM (1 child)

    by JamesWebb (59459) on Wednesday April 08, @12:51AM (#1439214)

    Another epicycle. Solve the problem by adding a compression tool on top of it is MONKEY.

    • (Score: 0) by Anonymous Coward on Wednesday April 08, @08:01PM

      by Anonymous Coward on Wednesday April 08, @08:01PM (#1439338)

      Next up in the innovation framework: removing compression tools for added accuracy!

  • (Score: 1) by shrewdsheep on Wednesday April 08, @08:10AM

    by shrewdsheep (5215) on Wednesday April 08, @08:10AM (#1439253)

    This is a dupe, but TFS is much clearer than with the previous submission. LLMs (or other deep learning based models) tend to produce semantic directions, that allow to perform geometric manipulation in representation space, e.g. adding a sex vector to change sex of a representation, tiny/large, color etc.
    It makes a lot of sense to switch to a directional representation, i.e. polar coordinates. Quantization in this space might even be beneficial by removing noise, i.e. directions with low component values are shrunken to zero. Finally, standardization comes for free, as the length is simply dropped and assumed to be 1 (well, there is the sign mentioned for which one bit is needed).

     

(1)