SoylentNews Comments | Which Compression Format to Use for Archiving?

Which Compression Format to Use for Archiving?

posted by martyb on Thursday September 12 2019, @07:22PM

from the it-depends dept.

Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.

After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.
The compression tool must be opensource.
The compression format must be open.
The tool must be popular enough to be supported by the community.
Ideally there would be multiple implementations.
The format must be resilient to data loss.
Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.

He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.

Original Submission

Starting Score:

point

Moderation

Informative=2, Total=2

Extra 'Informative' Modifier

Karma-Bonus Modifier

Total Score:

This discussion has been archived. No new comments can be posted.

Which Compression Format to Use for Archiving? | Log In/Create an Account | Top | 100 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Re:compression expert here Re:compression expert here (Score: 4, Informative) by bzipitidoo on Friday September 13 2019, @12:31PM (3 children)

by bzipitidoo (4388) on Friday September 13 2019, @12:31PM (#893606) Journal

Oh, I know bzip2 very well. Why do you think I chose the handle I use? :)

xz is the update that gzip needed. bzip2 was created in the 1990s, and could use an update. Now that we have so much more memory than we did then, the Burrows-Wheeler Transform could be done with much larger blocks than bzip2 uses. That alone would greatly increase the compression. The initial run-length encoding it does was a feeble attempt to make the effective block size of 900k a little bigger, sometimes. The author admitted it wasn't worth doing. Adds a little more complication, for little to no gain.

The BWT itself can be improved slightly, with some changes to the sorted order it produces. For instance, if you make a simple rearrangement of the alphabet before compressing a typical text file, the compressed size will be 0.5% smaller. Put similar letters near one another, such as, put all the vowels together so that the alphabetic order is AEIOUBCDFGHJ.... Want to rearrange the consonants as well, for instance move C next to K or S. Another improvement in the sorted output, of only 0.25% shrinkage in the compressed file size, but it works on everything, not just text, is here: https://www.researchgate.net/publication/320978048_bzip2r-102tar [researchgate.net] Mind, the files it produces are not compatible with bzip2. It's just a demonstration of the improvement.

At the back end, Move-To-Front has been subjected to intense study, as it seemed so crude and simple, ripe for improvement. Some improvements have been found, but MTF proved more difficult to better than was expected. Finally, that Huffman coding can now be replaced with arithmetic coding. Did you ever wonder about bzip? bzip used arithmetic coding, and because at that time there were still patents on arithmetic coding, no one would touch it. It was only when the author created bzip2, in which the big difference from bzip was the replacement of arithmetic coding with Huffman, that the utility gained traction and became popular.

There are also some simple speed improvements. bzip2 compares data one byte at a time. Can make it run 10% to 15% faster just by flipping the array in which data is stored, so that a 4 bytes at a time can be compared in one operation on a 4 byte word, if on little-endian computers such as every x86 machine ever made. And that's on 32bit machines. With 64bit, can compare 8 bytes at a time. If on a big-endian computer, would want to leave the byte order unchanged, of course. That's just one possible performance improvement. There is also pbzip2, which gets speed increases by employing multiple cores, if present.

Parent

Starting Score:	1		point
Moderation		+2
Informative=2, Total=2
Extra 'Informative' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		4

Re:compression expert here (Score: 0) by Anonymous Coward on Friday September 13 2019, @08:52PM

by Anonymous Coward on Friday September 13 2019, @08:52PM (#893849)

You aren't understanding my point because I misread yours.
In contrast bzip2 can need as much as 8M. xz is in many ways an update of gzip, using the same basic method but with a number of improvements to increase compression, and it takes advantage of the much greater capacities of current computers as compared to the 1980s computers. xz can need several gigabytes of RAM.
In contrast bzip2 can need as much as 8M. xz is in many ways an update of bzip2, using the same basic method but with a number of improvements to increase compression, and it takes advantage of the much greater capacities of current computers as compared to the 1980s computers. xz can need several gigabytes of RAM.
You said the former, what I thought I read was the latter. That was the point I was responding too. Hence why you didn't understand it and probably thought it was out of left field.

Parent
Re:compression expert here Re:compression expert here (Score: 0) by Anonymous Coward on Saturday September 14 2019, @08:01AM (1 child)

by Anonymous Coward on Saturday September 14 2019, @08:01AM (#894006)

I came to see whether you responded to my last post, an re-read this one again. It really makes me wonder what a bzip3 would be like with all the improvements that can be brought to the table. Not just from the intervening years of theoretical new improvements and patents on old improvements expiring, but also from the fact that people now have no problem giving the compressor over a gigabyte of memory (rzip, lrzip, xz -9) blazing fast, multi-core processors with plenty of time (xz -e, zopfli, brotli), and larger executable sizes to allow more complex processing and to cut data that needs to tag along on the compressed file.

Parent
- Re:compression expert here (Score: 2) by bzipitidoo on Saturday September 14 2019, @11:34AM
  
  by bzipitidoo (4388) on Saturday September 14 2019, @11:34AM (#894032) Journal
  
  Data compression researchers have largely moved on from the BWT, to (or back to) PPM, Prediction by Partial Matching. And to using neural networks and AI to make predictions. The BWT turned out not to be a totally new way to compress data after all, but a way to do PPM fast, or a demonstration that PPM could be much faster. 1980s PPM implementations were extraordinarily slow. If a bzip3 was created, with all the improvements to Burrows-Wheeler Compression that are now known, it would be close to the best PPM, but it would fall short. It would be no faster, and its compressed output would be larger, if only slightly. So why make one? I would like to see a bzip3 anyway, but it's not in the cards.
  
  Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Which Compression Format to Use for Archiving?

Re:compression expert here Re:compression expert here (Score: 4, Informative) by bzipitidoo on Friday September 13 2019, @12:31PM (3 children)

Re:compression expert here (Score: 0) by Anonymous Coward on Friday September 13 2019, @08:52PM

Re:compression expert here Re:compression expert here (Score: 0) by Anonymous Coward on Saturday September 14 2019, @08:01AM (1 child)

Re:compression expert here (Score: 2) by bzipitidoo on Saturday September 14 2019, @11:34AM