Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Saturday February 29 2020, @06:05PM   Printer-friendly
from the zip-it-up dept.

Hans Wennborg does a deep dive into the history and evolution of the Zip compression format and underlying algorithms in a blog post. While this lossless compression format became popular around three decades ago, it has its roots in the 1950s and 1970s. Notably, as a result of the "Arc Wars" of the 1980s, hitting BBS users hard, the Zip format was dedicated to the public domain from the start. The main work of the Zip format is performed through use of Lempel-Ziv compression (LZ77) and Huffman coding.

I have been curious about data compression and the Zip file format in particular for a long time. At some point I decided to address that by learning how it works and writing my own Zip program. The implementation turned into an exciting programming exercise; there is great pleasure to be had from creating a well oiled machine that takes data apart, jumbles its bits into a more efficient representation, and puts it all back together again. Hopefully it is interesting to read about too.

This article explains how the Zip file format and its compression scheme work in great detail: LZ77 compression, Huffman coding, Deflate and all. It tells some of the history, and provides a reasonably efficient example implementation written from scratch in C. The source code is available in hwzip-1.0.zip.

Previously:
Specially Crafted ZIP Files Used to Bypass Secure Email Gateways (2019)
Which Compression Format to Use for Archiving? (2019)
The Math Trick Behind MP3s, JPEGs, and Homer Simpson's Face (2019)
Ask Soylent: Internet-communication Archival System (2014)


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by Anonymous Coward on Saturday February 29 2020, @06:47PM (10 children)

    by Anonymous Coward on Saturday February 29 2020, @06:47PM (#964639)

    How could someone write about LZW and completely miss the fact that it was patented and aggressively defended.
    https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch [wikipedia.org]
    https://en.wikipedia.org/wiki/PKZIP [wikipedia.org]

    Starting Score:    0  points
    Moderation   +4  
       Insightful=1, Interesting=2, Informative=1, Total=4
    Extra 'Interesting' Modifier   0  

    Total Score:   4  
  • (Score: 5, Informative) by bzipitidoo on Saturday February 29 2020, @07:57PM (4 children)

    by bzipitidoo (4388) on Saturday February 29 2020, @07:57PM (#964652) Journal

    The author took a deep dive into ZIP, that's all. I mean, if you want to go on about what was left out, there's LZ78, and a whole host of tweaks and improvements, of which only LZSS was mentioned. LZMA and LZMA2 are perhaps the current state of the art in all the LZ variants.

    Then there's Huffman coding. Arithmetic Coding has been around since 1983, and is always as good or better than Huffman at shrinking the compressed size. Basically, Huffman can use only integer sizes. What if the optimum size for a code length is 6.5 bits? Huffman has to make it 6 or 7. Arithmetic Coding does not have that limitation. The problem was, Arithmetic Coding was patented. Consequently, everyone used Huffman. bzip2 is fairly well known, but did anyone ever wonder, what was bzip1, or just bzip? bzip exists, yes, but it uses Arithmetic Coding, and no one would use it for fear of lawsuits. The author made a new version, bzip2, which uses Huffman

    • (Score: 3, Interesting) by driverless on Saturday February 29 2020, @10:23PM (3 children)

      by driverless (4770) on Saturday February 29 2020, @10:23PM (#964688)

      Arithmetic coding wasn't patented. IBM had a binary variant that was, but no-one ever used it. Problem with arithmetic coding was that in its classical form it was really, really slow.

      • (Score: 3, Insightful) by bzipitidoo on Sunday March 01 2020, @12:26AM (2 children)

        by bzipitidoo (4388) on Sunday March 01 2020, @12:26AM (#964714) Journal

        What mattered was the possibility. People could not be sure that some patent troll with submarine patents would not pop up at the worst time and bring a lawsuit. Even with the certainty that the lawsuit was totally groundless, and the troll was guaranteed to lose in court, no one wanted to go through the hassle and expense of a lawsuit. It was simply safer to use the slightly inferior Huffman coding than take that chance, slim though it was.

        • (Score: 2) by driverless on Sunday March 01 2020, @06:14AM (1 child)

          by driverless (4770) on Sunday March 01 2020, @06:14AM (#964794)

          That had nothing to do with it. Pure arithemetic coding was unencumbered, in the 1980s no-one cared or even knew about patent trolls, and even for patented stuff like LZW everyone used it because Sperry weren't enforcing their patent. Even ignoring all of that, no-one operates like that because then you could literally never do anything at all.

          • (Score: 2) by bzipitidoo on Tuesday March 03 2020, @06:27PM

            by bzipitidoo (4388) on Tuesday March 03 2020, @06:27PM (#966072) Journal

            The perils of patents that came with Arithmetic Coding have everything to do with why, in the mid 1990s, bzip2 was accepted and bzip was not.

            The driving reason for the creation of PNG was that GIF was patent encumbered. Maybe Sperry wasn't trying to shakedown users, but Unisys most assuredly was. Website owners were encouraged to switch to PNG, with such initiatives as Burn All GIFs Day.

            Even in the 1980s, people were quite aware of intellectual property problems. After all, the FSF was founded in 1985.

            > Even ignoring all of that, no-one operates like that because then you could literally never do anything at all.

            It's true that it's not possible to write anything but the simplest of programs without inadvertently violating software patents by the hundreds. Not only were the courts crazy enough to allow the patenting of software, they even support patent holders' efforts to go after the users. You'd think they should only go after the creators of some allegedly infringing software, and leave the users out of the fight. But no, they can successfully sue the users. Before their patents expired, Unisys embarked on a massive campaign to shakedown every significant website that used a GIF anywhere at all. That's what users are anxious to avoid. And if that means a small hit to the best possible, in their view, it's worth it.

  • (Score: 3, Informative) by sjames on Saturday February 29 2020, @08:25PM (3 children)

    by sjames (2882) on Saturday February 29 2020, @08:25PM (#964659) Journal

    Compression was an example of a big grass roots win for free and open, starting with the war between SEA and PKWARE. The courts decided for SEA, but the people (at least the subset that even knew what file compression was) decided for PK and so SEA went away in short order. I remember BBSes at the time, and how quickly the BBS software was modified/configured to not even accept uploads compressed with SEA. Where they weren't outright rejected, users uploading something compressed with SEA would be dogpiled with messages (friendly and not so friendly) telling them to only use PKZIP (and sometimes why). All that before PKZIP got new compression methods that made it more effective than SEA's ARC.

    Unisys rattling the legal saber over GIF had similar results in spite of browser vendors dragging their feet implementing support for PNG.

    • (Score: 4, Informative) by pipedwho on Saturday February 29 2020, @09:04PM (2 children)

      by pipedwho (2032) on Saturday February 29 2020, @09:04PM (#964669)

      That's because SEA made a utility called 'arc' that generated .arc files. Then Phil Katz came along and made pkarc/pkxarc which did the same thing, but was considerably faster and produced slightly smaller files in most cases. The pkarc/pkxarc utilities were also a lot smaller, and you only needed to keep pkxarc around if you only every extracted files.

      Then SEA sued Phil Katz over a trademark issue and Phil renamed the pkarc/pkxarc utilities to pkpak/pkxpak which generated .pak files. These were basically identical to .arc files.

      However, then not long after that (and probably inspired by the law suit) Phil created the zip format and pkzip. Zip was almost as fast as pkarc/pkxarc, but produced much smaller files, and had a much improved container format with longer CRCs and optional encryption (yeah the first version of his encryption was totally insecure, but he fixed that when it was pointed out by the crypto community at the time). The .zip format saw the test of time and became the most popular archive format on BBSes and the internet. The size/speed tradeoff is still the reason why it (and other implementations of the format like gzip) are still around.

      LZMA/LZMA2 is much better with compression size, but is nowhere near as fast at either compression or extraction. And with high speed networks and large storage, the extra 20% size saving may not be worth the time penalty.

      • (Score: 2) by sjames on Saturday February 29 2020, @09:36PM

        by sjames (2882) on Saturday February 29 2020, @09:36PM (#964678) Journal

        The technical advantages are why it is still around today. The legal/social issues are why SEA went away never to be seen again (SOURCE: I was active on BBSes at the time). Until the crazy trademark crap, there was room for both in the community.

      • (Score: 0) by Anonymous Coward on Monday March 02 2020, @05:20PM

        by Anonymous Coward on Monday March 02 2020, @05:20PM (#965557)

        The size/speed tradeoff is still the reason why ... other implementations of the [.zip] format like gzip are still around.

        Hrm, gzip uses the DEFLATE algorithm for compression but it's a stretch to call it an implementation of the .zip format...

        Gzip was of course designed as a replacement for the traditional Unix "compress" utility which suffered from LZW patent threats in the early 1990s.

        I suspect the real reason for gzip's staying power in the Unix-like world has little to do with its performance (good or bad) but more to do with interoperability and network effects: basically everyone has gzip so everyone can unpack the file if you send them a .tar.gz, maybe not so much if you send them something weird like .tar.lz...

  • (Score: 4, Interesting) by zmodem on Sunday March 01 2020, @09:18AM

    by zmodem (9643) on Sunday March 01 2020, @09:18AM (#964842)

    Author here. I didn't write about LZW as I wanted to focus on Deflate, which is not based on LZW.

    While LZW is also interesting, and PKZip's legacy "shrink" method used it, I figured the article was already long as it is :-)

    LZW's patent problem does get a brief mention, as the motivation for why zlib was split out of info-zip and used for PNG.