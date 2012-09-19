19/09/12/0931239 story
posted by martyb on Thursday September 12, @07:22PM
from the it-depends dept.
Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.
After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.
- The compression tool must be opensource.
- The compression format must be open.
- The tool must be popular enough to be supported by the community.
- Ideally there would be multiple implementations.
- The format must be resilient to data loss.
Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.
He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.
(Score: 0) by Anonymous Coward on Thursday September 12, @07:31PM (1 child)
For archiving??? Um, none.
(Score: 2) by PartTimeZombie on Thursday September 12, @08:07PM
At the moment I am rsync-ing to a series of removable drives that get rotated offsite, no compression.
The backup has grown to about 1.2 TB or so, which I suppose is not a massive amount of data, but restores work fine when I test them, so I'm happy without compression.
(Score: 0) by Anonymous Coward on Thursday September 12, @07:42PM (1 child)
I do not care about compression since LZ4 is now everywhere deep in the system where it belongs.
(Score: 2) by NateMich on Thursday September 12, @07:49PM
Isn't LZ4's compression ratio pretty terrible though?
It is very fast of course, but when your archives require twice as much space, that could be an issue.
(Score: 3, Interesting) by SomeGuy on Thursday September 12, @07:46PM (1 child)
In practice, error correction for compressed archives is bullshit. Most error correction assumes a single bit gets flipped, but more often archives will have gaps of 512 bytes or larger due to bad sectors reading a hard disk, mysteriously missing areas in the center due to network copy errors, truncated due to uploads crapping out, every CR converted to CRLF because someone did not know how to use FTP, or such.
Single bit errors usually happen in RAM, and from my own experience this messes things up either before or after the file is compressed/decompressed, so the archiver's error checking usually can't catch that
Unfortunately, I have had both 7z and RAR change their default compression methods on me, and I think some ZIP software programs have tried to add some of their own. The problem is, you compress a file and then suddenly anyone with an "old" archiver cant uncompress your archive any more. (start assholes bitching about how everyone should always be using the latest and greatest here and then slap them upside the head because in the real world this is not always possible, especially when dealing with vintage or legacy systems.)
Some might actually recommend NOT using compression AT ALL. Just put them uncompressed in a zip/7z/RAR container, where extraction in the worst case is trivial. As a bonus, if you are using a compressed/deduplicated file system, it may get much better compression.
The only thing "resilient" to data loss is lots of backups, and constant checking and re-checking that what is supposed to be in the file(s) is what is actually there.
(Score: 0) by Anonymous Coward on Thursday September 12, @08:12PM
Error correction for archives, compressed or not, isn't bs at all. Distributed storage with erasure codes easily handles bitrot and large errors (sector rot?). This comes at a cost of complexity and increased space but does allow for self-correcting of errors.
(Score: 0) by Anonymous Coward on Thursday September 12, @07:54PM
Be sure to have inside and outside MD5 and/or SHA1 hashcode files protecting your data -- make many backups on many different types or brands of media
(Score: 2) by JoeMerchant on Thursday September 12, @07:54PM
At present, I'm getting a .zip file from our Windoze dev ops server, everything else I shuffle around in .tar.gz format because... it's easy.
If you want to be "safe," make multiple copies.
If you want to be "safer," keep those multiple copies separated as far as practical from one another.
If you're worried about the efficiency of the compression algorithm - either you're in a very special high data volume industry, or you haven't noticed what's happened to storage prices, storage device sizes and transfer speeds in the last decade (same could have been said 10 and 20 years ago.)
As for widespread availability of the compression/decompression tools, I believe I was using the same .zip algorithms to distribute software on floppy disks back in the '90s, and .tar.gz is about as ubiquitous as it gets in the Linux world. Maybe there are others, but if anybody using a standard OS or reasonably feature rich distro's base configuration needs to install a piece of software to open your archive, I'd say you're doing it wrong.
(Score: 0) by Anonymous Coward on Thursday September 12, @08:15PM
DVD Disaster [sourceforge.net], extra space but more safety.