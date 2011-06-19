from the zip-it-up dept.
Hans Wennborg does a deep dive into the history and evolution of the Zip compression format and underlying algorithms in a blog post. While this lossless compression format became popular around three decades ago, it has its roots in the 1950s and 1970s. Notably, as a result of the "Arc Wars" of the 1980s, hitting BBS users hard, the Zip format was dedicated to the public domain from the start. The main work of the Zip format is performed through use of Lempel-Ziv compression (LZ77) and Huffman coding.
I have been curious about data compression and the Zip file format in particular for a long time. At some point I decided to address that by learning how it works and writing my own Zip program. The implementation turned into an exciting programming exercise; there is great pleasure to be had from creating a well oiled machine that takes data apart, jumbles its bits into a more efficient representation, and puts it all back together again. Hopefully it is interesting to read about too.
This article explains how the Zip file format and its compression scheme work in great detail: LZ77 compression, Huffman coding, Deflate and all. It tells some of the history, and provides a reasonably efficient example implementation written from scratch in C. The source code is available in hwzip-1.0.zip.
Of late, the volume of my Internet-based correspondence has been showing serious growth, and I've begun taking the preservation and archival of the products more seriously -- email conversations, message board threads, even some IRC discussions. I would like to use a unified system to archive these text-oriented communications. What solutions do Soylentils use or suggest? (I'm a Linux- and Vim-user, although discussion about systems or tools on any OS are welcome.)
Requirements:
-Browseable: A must! The value of easily revisiting past communications is immesureable. .zip'd-type formats may save space, but even years of text-based personal correspondance don't amount to much in comparison to a few music albums or feature-length movies.
-Stored in a long-lasting format/encoding: ASCII won't work as Internet-based communication often contains structural elements like links and lists, not to mention RTF-style formatting. HTML seems like a good start.
-Maintains linear structure of discussion threads
-Searchable: Last and definitely least-necessary feature -- 'grep' is always an easy first resort :)
-Tag-able: If search features are built-in, this is an obviously valuable feature.
The Math Trick Behind MP3s, JPEGs, and Homer Simpson's Face
Over a decade ago, I was sitting in a college math physics course and my professor spelt out an idea that kind of blew my mind. I think it isn't a stretch to say that this is one of the most widely applicable mathematical discoveries, with applications ranging from optics to quantum physics, radio astronomy, MP3 and JPEG compression, X-ray crystallography, voice recognition, and PET or MRI scans. This mathematical tool—named the Fourier transform, after 18th-century French physicist and mathematician Joseph Fourier—was even used by James Watson and Francis Crick to decode the double helix structure of DNA from the X-ray patterns produced by Rosalind Franklin. (Crick was an expert in Fourier transforms, and joked about writing a paper called, "Fourier Transforms for birdwatchers," to explain the math to Watson, an avid birder.)
You probably use a descendant of Fourier's idea every day, whether you're playing an MP3, viewing an image on the web, asking Siri a question, or tuning in to a radio station. (Fourier, by the way, was no slacker. In addition to his work in theoretical physics and math, he was also the first to discover the greenhouse effect.)
So what was Fourier's discovery, and why is it useful?
The story provides great visual examples of how even complex waves can be approximated by a series of sine waves summed together. Further, the parameters to the sine waves and a much more concise description of the approximated item. Examples are given of a roughly-square wave. Another example uses circles instead of sine waves. A great YouTube video shows these in action.
Wish I had this available to me before I was taught FT and FFT in college!
Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.
After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.
- The compression tool must be opensource.
- The compression format must be open.
- The tool must be popular enough to be supported by the community.
- Ideally there would be multiple implementations.
- The format must be resilient to data loss.
Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.
He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.
Specially Crafted ZIP Files Used to Bypass Secure Email Gateways
Attackers are always looking for new tricks to distribute malware without them being detected by antivirus scanners and secure email gateways. This was illustrated in a new phishing campaign that utilized a specially crafted ZIP file that was designed to bypass secure email gateways to distribute the NanoCore RAT.
Every ZIP archive contains a special structure that contains the compressed data and information about the compressed files. Each ZIP archive also contains a single "End of Central Directory” (EOCD) record, which is used to indicate the end of the archive structure.
In a new spam campaign discovered by Trustwave, researchers encountered a spam email pretending to be shipping information from an Export Operation Specialist of USCO Logistics.
Attached to this email was a ZIP archive named SHIPPING_MX00034900_PL_INV_pdf.zip that looked suspicious as its file size was greater than its uncompressed content.
"The attachment “SHIPPING_MX00034900_PL_INV_pdf.zip“ makes this message stand out," Trustwave stated in a report. "The ZIP file had a file size significantly greater than that of its uncompressed content. Typically, the size of the ZIP file should be less than the uncompressed content or, in some cases, ZIP files will grow larger than the original files by a reasonable number of bytes."
When examining the file, the Trustwave researchers discovered that the ZIP archive contained two distinct archive structures, each marked by their own EOCD record.
This is illustrated by the file opened in 010 Editor, which shows two different ZIDENDLOCATOR structures.
(Score: 0) by Anonymous Coward on Saturday February 29, @06:47PM
How could someone write about LZW and completely miss the fact that it was patented and aggressively defended.
https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch [wikipedia.org]
https://en.wikipedia.org/wiki/PKZIP [wikipedia.org]
(Score: 2) by Uncle_Al on Saturday February 29, @06:51PM (1 child)
and horrible if there is any corruption in the file
At least tar spread the directory across the archive
(Score: 1) by fustakrakich on Saturday February 29, @07:07PM
Tar doesn't compress