Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Thursday September 12 2019, @07:22PM   Printer-friendly
from the it-depends dept.

Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.

After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.

  • The compression tool must be opensource.
  • The compression format must be open.
  • The tool must be popular enough to be supported by the community.
  • Ideally there would be multiple implementations.
  • The format must be resilient to data loss.

Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.

He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by maxwell demon on Friday September 13 2019, @08:54AM (6 children)

    by maxwell demon (1608) on Friday September 13 2019, @08:54AM (#893563) Journal

    The problem is that without the header, you can't find where each file starts, nor decompress it.

    There's a file header in front of each and every single file, describing that file,and that file only. The first file header starts right at the beginning, and each file header contains the length of the compressed data, so there's no question where the next file begins.

    And in the case one file header is corrupted, or the zip archive doesn't begin at the beginning of the file (common e.g. on DOS/Windows self-extracting archives -- although there the beginning could be determined from the EXE header data), each file header starts with a 4-byte local file header signature (a fixed, specified 4 byte sequence), so you've got a good chance to identify file headers even if you can't make sense of the preceding data.

    Moreover, each file header contains the file name unencrypted, so that's a further way to identify where the file you are looking for resides.

    your not looking for PK, you're looking for the hex sequence 04 03 4B 50. Which has less false positives because (a) it is twice as long, and (b) it is *not* ASCII text, and thus is unlikely to appear in text (while "PK" is only two bytes, and ASCII letters that may well happen to be part of some file name).

    --
    The Tao of math: The numbers you can count are not the real numbers.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by Reziac on Friday September 13 2019, @02:16PM (5 children)

    by Reziac (2489) on Friday September 13 2019, @02:16PM (#893646) Homepage

    Yeah, done enough trawling hex to be aware. My eyes are used to binary. The other problem with a regular zip is even if you can find the body parts, absent an intact header you don't have the info to unpack them.

    --
    And there is no Alkibiades to come back and save us from ourselves.
    • (Score: 2) by maxwell demon on Friday September 13 2019, @04:02PM (4 children)

      by maxwell demon (1608) on Friday September 13 2019, @04:02PM (#893713) Journal

      Ah, playing the “moving the goalpost” game.

      Anyway, there are only a few compression methods, so just try decompression with each of them in turn until you get a working file.

      But what is the probability that the file header is completely destroyed, but the immediately following file data is kept completely intact? I'd expect very low. And with usual compression methods, if the beginning of a file is corrupted, basically the complete file is gone. And that is true no matter whether you store the compressed file in a zip file, in a tar file, or as standalone file in the filesystem.

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by Reziac on Friday September 13 2019, @04:47PM (3 children)

        by Reziac (2489) on Friday September 13 2019, @04:47PM (#893731) Homepage

        "if the beginning of a file is corrupted, basically the complete file is gone."

        See, that's exactly what I've been griping about. Perhaps not well stated but that's the gist of it.

        --
        And there is no Alkibiades to come back and save us from ourselves.
        • (Score: 3, Insightful) by maxwell demon on Friday September 13 2019, @06:01PM (2 children)

          by maxwell demon (1608) on Friday September 13 2019, @06:01PM (#893769) Journal

          But the point is that this is not a property of the zip archive format (and has nothing to do with headers), but a general property of compressed files, and thus would also be true for your claimed “better” format (which actually is basically equivalent to zip, as far as you specified it).

          --
          The Tao of math: The numbers you can count are not the real numbers.
          • (Score: 3, Interesting) by Reziac on Friday September 13 2019, @07:17PM (1 child)

            by Reziac (2489) on Friday September 13 2019, @07:17PM (#893817) Homepage

            I think I said that earlier. If object is a bulletproof archive, don't compress by any means whatsoever.

            And what I was going for is making a horrible format (.DOCX) slightly less horrible, assuming the guts remain a gaggle of XML and CSS; it would still suck. Personally I'd get rid of the whole thing.

            --
            And there is no Alkibiades to come back and save us from ourselves.
            • (Score: 0) by Anonymous Coward on Friday September 13 2019, @08:34PM

              by Anonymous Coward on Friday September 13 2019, @08:34PM (#893842)

              As an outside observer, I still don't get what your point is. Part of the problem, I think, is that you have an idea of what the PKZIP specifies, but don't actually know, or that you have insufficiently specified how your version is different from zip. Zip files have individual headers located at the start of each compressed file that contains all the information necessary to decompress, verify, and extract that particular file. In addition, there is also the central directory trailer that contains all the information necessary to decompress, verify, and extract each and every file. In the event the trailer is trashed, you can still iterate the file and decompress it; if a file header is trashed, you can use the directory to decompress it. Worst case scenario, you just look for the next magic "PK\x03\x04" "PK\x05\x06" or "PK\x07\x08" in case both the trailer and previous file header is trashed.

              What, exactly, are you proposing that is different or better?