Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Thursday September 12 2019, @07:22PM   Printer-friendly
from the it-depends dept.

Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.

After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.

  • The compression tool must be opensource.
  • The compression format must be open.
  • The tool must be popular enough to be supported by the community.
  • Ideally there would be multiple implementations.
  • The format must be resilient to data loss.

Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.

He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Informative) by bzipitidoo on Friday September 13 2019, @12:31PM (3 children)

    by bzipitidoo (4388) on Friday September 13 2019, @12:31PM (#893606) Journal

    Oh, I know bzip2 very well. Why do you think I chose the handle I use? :)

    xz is the update that gzip needed. bzip2 was created in the 1990s, and could use an update. Now that we have so much more memory than we did then, the Burrows-Wheeler Transform could be done with much larger blocks than bzip2 uses. That alone would greatly increase the compression. The initial run-length encoding it does was a feeble attempt to make the effective block size of 900k a little bigger, sometimes. The author admitted it wasn't worth doing. Adds a little more complication, for little to no gain.

    The BWT itself can be improved slightly, with some changes to the sorted order it produces. For instance, if you make a simple rearrangement of the alphabet before compressing a typical text file, the compressed size will be 0.5% smaller. Put similar letters near one another, such as, put all the vowels together so that the alphabetic order is AEIOUBCDFGHJ.... Want to rearrange the consonants as well, for instance move C next to K or S. Another improvement in the sorted output, of only 0.25% shrinkage in the compressed file size, but it works on everything, not just text, is here: https://www.researchgate.net/publication/320978048_bzip2r-102tar [researchgate.net] Mind, the files it produces are not compatible with bzip2. It's just a demonstration of the improvement.

    At the back end, Move-To-Front has been subjected to intense study, as it seemed so crude and simple, ripe for improvement. Some improvements have been found, but MTF proved more difficult to better than was expected. Finally, that Huffman coding can now be replaced with arithmetic coding. Did you ever wonder about bzip? bzip used arithmetic coding, and because at that time there were still patents on arithmetic coding, no one would touch it. It was only when the author created bzip2, in which the big difference from bzip was the replacement of arithmetic coding with Huffman, that the utility gained traction and became popular.

    There are also some simple speed improvements. bzip2 compares data one byte at a time. Can make it run 10% to 15% faster just by flipping the array in which data is stored, so that a 4 bytes at a time can be compared in one operation on a 4 byte word, if on little-endian computers such as every x86 machine ever made. And that's on 32bit machines. With 64bit, can compare 8 bytes at a time. If on a big-endian computer, would want to leave the byte order unchanged, of course. That's just one possible performance improvement. There is also pbzip2, which gets speed increases by employing multiple cores, if present.

    Starting Score:    1  point
    Moderation   +2  
       Informative=2, Total=2
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 0) by Anonymous Coward on Friday September 13 2019, @08:52PM

    by Anonymous Coward on Friday September 13 2019, @08:52PM (#893849)

    You aren't understanding my point because I misread yours.

    In contrast bzip2 can need as much as 8M. xz is in many ways an update of gzip, using the same basic method but with a number of improvements to increase compression, and it takes advantage of the much greater capacities of current computers as compared to the 1980s computers. xz can need several gigabytes of RAM.

    In contrast bzip2 can need as much as 8M. xz is in many ways an update of bzip2, using the same basic method but with a number of improvements to increase compression, and it takes advantage of the much greater capacities of current computers as compared to the 1980s computers. xz can need several gigabytes of RAM.

    You said the former, what I thought I read was the latter. That was the point I was responding too. Hence why you didn't understand it and probably thought it was out of left field.

  • (Score: 0) by Anonymous Coward on Saturday September 14 2019, @08:01AM (1 child)

    by Anonymous Coward on Saturday September 14 2019, @08:01AM (#894006)

    I came to see whether you responded to my last post, an re-read this one again. It really makes me wonder what a bzip3 would be like with all the improvements that can be brought to the table. Not just from the intervening years of theoretical new improvements and patents on old improvements expiring, but also from the fact that people now have no problem giving the compressor over a gigabyte of memory (rzip, lrzip, xz -9) blazing fast, multi-core processors with plenty of time (xz -e, zopfli, brotli), and larger executable sizes to allow more complex processing and to cut data that needs to tag along on the compressed file.

    • (Score: 2) by bzipitidoo on Saturday September 14 2019, @11:34AM

      by bzipitidoo (4388) on Saturday September 14 2019, @11:34AM (#894032) Journal

      Data compression researchers have largely moved on from the BWT, to (or back to) PPM, Prediction by Partial Matching. And to using neural networks and AI to make predictions. The BWT turned out not to be a totally new way to compress data after all, but a way to do PPM fast, or a demonstration that PPM could be much faster. 1980s PPM implementations were extraordinarily slow. If a bzip3 was created, with all the improvements to Burrows-Wheeler Compression that are now known, it would be close to the best PPM, but it would fall short. It would be no faster, and its compressed output would be larger, if only slightly. So why make one? I would like to see a bzip3 anyway, but it's not in the cards.