Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Thursday September 12 2019, @07:22PM   Printer-friendly
from the it-depends dept.

Web developer Ukiah Smith wrote a blog post about which compression format to use when archiving. Obviously the algorithm must be lossless but beyond that he sets some criteria and then evaluates how some of the more common methods line up.

After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.

  • The compression tool must be opensource.
  • The compression format must be open.
  • The tool must be popular enough to be supported by the community.
  • Ideally there would be multiple implementations.
  • The format must be resilient to data loss.

Some formats I am looking at are zip, 7zip, rar, xz, bzip2, tar.

He closes by mentioning error correction. That has become more important than most acknowledge due to the large size of data files, the density of storage, and the propensity for bits to flip.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by Anonymous Coward on Thursday September 12 2019, @07:31PM (31 children)

    by Anonymous Coward on Thursday September 12 2019, @07:31PM (#893284)

    For archiving??? Um, none.

    Starting Score:    0  points
    Moderation   +4  
       Insightful=4, Disagree=1, Total=5
    Extra 'Insightful' Modifier   0  

    Total Score:   4  
  • (Score: 2) by PartTimeZombie on Thursday September 12 2019, @08:07PM

    by PartTimeZombie (4827) on Thursday September 12 2019, @08:07PM (#893304)

    At the moment I am rsync-ing to a series of removable drives that get rotated offsite, no compression.

    The backup has grown to about 1.2 TB or so, which I suppose is not a massive amount of data, but restores work fine when I test them, so I'm happy without compression.

  • (Score: 4, Funny) by DannyB on Thursday September 12 2019, @09:17PM (2 children)

    by DannyB (5839) Subscriber Badge on Thursday September 12 2019, @09:17PM (#893352) Journal

    Use your choice of Al Gore rhythm. Compress. Then compress the compressed file. Repeat until it is compressed down to a single binary bit.

    When you need to restore, simply take that 1 or 0 and decompress it multiple times back to your original content.

    (With apologies to Information Theory.)

    I had a clown coworker about 18 years ago. He found a scam (?) article making unbelievable claims about some new compression. Of course he believed it. I tried to explain how compression actually works. But science, information theory, etc are no match for someone believing in the compression equivalent of perpetual motion. I heard him utter the same kinds of things perpetual motion advocates say. But maybe they've found some NEW way of compressing better than you think? Why don't you wait and see when people start buying their product? They didn't offer details about the algorithm. Promised details later, etc. Reminds me of the recent claim to have found some fantastic way of factoring semiprimes.

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 0) by Anonymous Coward on Friday September 13 2019, @04:14AM (1 child)

      by Anonymous Coward on Friday September 13 2019, @04:14AM (#893523)

      When I was a young pup of sixteen I remember seeing ads for sale in the back of my dad's Popular Mechanics magazines for the miraculous Pogue carburetor. Of course as a poor kid making minimum wage, struggling to keep the tank filled and a bit mechanically inclined, i was sorely tempted to order a set of plans and try my luck implementing it. Fast forward a couple years and some freshman physics and I realized how foolish I was to entertain the idea.

      https://www.snopes.com/fact-check/nobodys-fuel/ [snopes.com]

  • (Score: 2) by The Mighty Buzzard on Thursday September 12 2019, @10:34PM

    by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 12 2019, @10:34PM (#893392) Homepage Journal

    Depends on what you're archiving.

    --
    My rights don't end where your fear begins.
  • (Score: 2) by barbara hudson on Thursday September 12 2019, @10:52PM (10 children)

    by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Thursday September 12 2019, @10:52PM (#893402) Journal

    Most of the stuff people archive will probably never be looked at again. Old pics? Old source code that's obsolete? It's like when everyone went nuts buying a crap ton of blank video cassettes to copy programs that they never watched and might as well toss because they can't find a vcr to play them on.

    If people actually took the time to sort through their files, everything would probably fit on one USB key with plenty of space to spare. No compression needed. It would also make it easier to (a) find important files since they won't be sitting on the same device as a decade of crap you backed up from previous machines because you had the space, and (b) makes it easier to switch machines / operating systems, since all your important data is so small.

    --
    SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
    • (Score: -1, Spam) by Anonymous Coward on Friday September 13 2019, @12:05AM

      by Anonymous Coward on Friday September 13 2019, @12:05AM (#893432)

      Oh I don't know about that. I looked at this from all over this site today and must laugh at your numerous blunders against apk https://soylentnews.org/comments.pl?noupdate=1&sid=33430&page=1&cid=889582#commentwrap [soylentnews.org]

    • (Score: 3, Interesting) by Immerman on Friday September 13 2019, @01:12AM (7 children)

      by Immerman (3985) on Friday September 13 2019, @01:12AM (#893456)

      The problem I've found with that strategy exclusively, is that it's all but impossible to accurately anticipate *everything* that you'll end up wanting to go back to look at decades later. You'll get most of it, but will inevitably have "if only I had kept..." moments.

      The solution though is relatively simple - keep a slow, cheap bulk storage drive around, and just don't bother throwing anything away. Storage is cheap and growing constantly. Unless you're collecting or generating new data at a truly impressive rate, your next bulk storage drive will almost certainly be considerably larger than your current one - so just copy everything from the old drive into one corner of the new one, and and retire the old one to offline archive duty. Just let that data accumulate.

      I would think the strategies could complement each other beautifully though. It's far easier to "throw away" things from your curated archive when you know you can always go retrieve them from the comprehensive archive at some much later date. Could really facilitate aggressive pruning.

      I don't have a decade of hindsight on this bit yet - but I think that using sentence-length descriptive file names is also going to make a huge difference in being able to find those things from decades ago, as well as making it easier to recognize things that can be pruned. It has certainly made finding more recent files hundreds of times easier, when combined with a good instantly-find-as-you-type filesystem indexing tool that works with word fragments (I use everything.exe on Windows). Some sort of database filesystem might be nice, but honestly I'd worry about compatibility - almost everything can handle long file names, and they're kind of handy even without the indexing.

      • (Score: 3, Interesting) by barbara hudson on Friday September 13 2019, @01:54AM (6 children)

        by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Friday September 13 2019, @01:54AM (#893472) Journal
        At some point you realize that stuff that's gone isn't the end of the world. Pictures? I've got my memories, and that's better. And if I ever get senile, pictures won't help, and my advanced medical directive kicks in and I die, so I'm not gonna worry about it.

        Their there were a fire, my priorities would be my identity papers and cards, some clothes, and my dogs. Everything else, including my data, is secondary. If I lost my movie and mp3 collection tomorrow, I would not sweat it. Same with my laptop and all my data backups. I guess my perspective has changed as I got old(er). Except not really, my list of things that were important hasn't changed in 25 years - dogs, I'd paperwork, some clothes on my back. Just not the same dogs (they get old, they get sick, but I stay with them to the end and I remember each one, don't need pictures).

        --
        SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
        • (Score: 2) by krishnoid on Friday September 13 2019, @03:25AM

          by krishnoid (1156) on Friday September 13 2019, @03:25AM (#893509)

          Interesting you say those things, because other than the dogs, those are pretty replaceable. Hopefully your dogs would be the ones saving you if there was a fire.

        • (Score: 0) by Anonymous Coward on Friday September 13 2019, @01:33PM

          by Anonymous Coward on Friday September 13 2019, @01:33PM (#893622)

          You're conflating "end of the world" and "good/want to have".

          Surely, it's not the end of the world if gone, yet many people want pics, emails, and do read some of them, even if you do not.

          And.. parent poster seems to be concerned, which may indicate historical annoyance at having something gone MIA. So, it's not about 'end of the world', but 'I can do this, I want to do this, so why on Earth not do it correctly, and well'.

          As far as I'm concerned, most people take 1 hour to do a 2 hour task, but then the 1 hour task ends up fruitless, as it was not done correctly. Do it right, or don't bother...

        • (Score: 0) by Anonymous Coward on Friday September 13 2019, @02:10PM

          by Anonymous Coward on Friday September 13 2019, @02:10PM (#893641)

          Human memory is unreliable even if old age is somewhere far away in the future, though.

        • (Score: 2) by Immerman on Monday September 16 2019, @01:36AM (2 children)

          by Immerman (3985) on Monday September 16 2019, @01:36AM (#894489)

          Sure - precious little is the end of the world. You can lose everything you own, your livelihood, every person who loves you, and even your health and limbs, and the sun still comes up tomorrow. The human animal is an amazingly resilient beast - any trauma that doesn't break you will be just a memory before long.

          But you just never know what you're going to one day wish you had. Source code for a project you never expected to care about again, that would save you weeks of effort on something you're working on today. A memento of an occasion that seemed relatively minor at the time, but became far more significant in well-seasoned retrospect.

          I've purged several times, both digitally and physically, and have eventually regretted it every time. I still do it physically, because I just don't have room for all that junk - and as you say, memories are far more important than any memento anyway. But digital stuff doesn't take hardly any spce. Sure, I lose my movie collection no big deal - easily replaced if it matters enough to bother. My mp3 collection would be a bigger deal - I've taken decades rafting it to my tastes to provide a variety of different ambiances conductive to different endeavors. My documents... most I would probably never miss, but I know first hand that I'll never guess the ones I truly will. And so long as I have a I have a movie collection, the documents, mp3s, photos, etc. take up such a tiny amount of space in comparison that keeping them "just in case" is essentially free. It's not even a shoebox of old photos to lug around - just one tiny corner of a hard drive that's exactly the same size and weight either way.

          • (Score: 2) by barbara hudson on Monday September 16 2019, @05:30PM (1 child)

            by barbara hudson (6443) <barbara.Jane.hudson@icloud.com> on Monday September 16 2019, @05:30PM (#894705) Journal
            To each their own, but every time I've recreated source code that I had written and not kept, the rewrite has been better. Two reasons.

            1 More experience

            2 The nature of the problem at hand has changed, or the computing environment (from dos to Windows to*nix, from 16 to 32 to 64 bit, from monochrome to change to vga to true colour).

            Old code might never die, but it can become obsolete.

            --
            SoylentNews is social media. Says so right in the slogan. Soylentnews is people, not tech.
            • (Score: 2) by Immerman on Tuesday September 17 2019, @03:12PM

              by Immerman (3985) on Tuesday September 17 2019, @03:12PM (#895172)

              Oh absolutely.

              However, I've encountered numerous situations where recreating the source code isn't actually worth the time and effort required, so I make something "good enough" to do what I need, and just do without all the "nice to have" features I would have gotten for free if the old code was still around.

    • (Score: 2) by The Mighty Buzzard on Friday September 13 2019, @02:36AM

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 13 2019, @02:36AM (#893490) Homepage Journal

      I sort through my files a couple times a year and my multi-TB misc storage drive stays 75% full anyway; all data, no OS or programs. That's not even counting the emergency offline porn stash because it's on an entirely different drive.

      --
      My rights don't end where your fear begins.
  • (Score: 3, Insightful) by Reziac on Friday September 13 2019, @02:26AM (14 children)

    by Reziac (2489) on Friday September 13 2019, @02:26AM (#893484) Homepage

    ZIP up the junk if you wish. ZIP is pretty much universally compatible. RAR is a little more recoverable if the file gets truncated, and nearly as widely readable.

    But the important files? Oh hell no. Never ever not EVER compress them as your only backup/archive. Or otherwise, if you can avoid it:

    I work with writers. One of the things I do is extract data from hosed files. Guess what format, when the header is corrupt, is usually not recoverable? Compressed to ZIP, which is to say .DOCX or .ODT. Both inventions of the devil. Had one client lose an entire finished novel; it's probably still in that .DOCX, but with no header, not retrievable (at least not by any tool I tried, including hex editor). All the backups were corrupted the same way, so no joy there. This is when I started beating them with a stick until they agree to save their work in an uncompressed format.

    --
    And there is no Alkibiades to come back and save us from ourselves.
    • (Score: 2) by The Mighty Buzzard on Friday September 13 2019, @02:32AM (13 children)

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 13 2019, @02:32AM (#893487) Homepage Journal

      Writers need to learn basic HTML. There's fuck all they can create in any word processor that can't be done in HTML and you can read and write it in any text editor on the planet.

      --
      My rights don't end where your fear begins.
      • (Score: 4, Informative) by Reziac on Friday September 13 2019, @02:54AM (9 children)

        by Reziac (2489) on Friday September 13 2019, @02:54AM (#893497) Homepage

        Actually there are a whole bunch of structures HTML is awkward at; RTF is better at complex formatting, and equally hand-editable (I sometimes do this for money). And even then, there are a few structures WordPerfect does that nothing else can reasonably do. But point is all of these (and old Word .DOC too) can be hand-edited in either a plaintext editor or at worst, a hex editor; mangling the formatting or header does not lose the contents (and generally the formatting can be recovered too, if you know what you're looking at).

        .DOCX and .ODT are an .XML file that contains the actual text, a bunch of stylesheets, and whatever else one inserted into the document, all ZIP'd up together. So while that .XML file can be edited just like HTML ... first you gotta get it out of the ZIP, and that's where problems can arise. If I were designing such a horrible file format, I'd do something like zip each file individually, then TAR the mess together. That way it would still be compressed and single-file on disk, but in the event of corruption, most of it would be recoverable, using a hex editor and the proper swear words.

        Once hand-recovered 14,000 mangled JPGs from a failed RAID array... me and Frhed de-miscegenated their various parts, built new headers, and pasted their asses back together (only lost about 60 due to too many missing body parts). If they'd been compressed and no valid ZIP header, they'da been toast.

        --
        And there is no Alkibiades to come back and save us from ourselves.
        • (Score: 2) by maxwell demon on Friday September 13 2019, @06:11AM (8 children)

          by maxwell demon (1608) on Friday September 13 2019, @06:11AM (#893550) Journal

          If I were designing such a horrible file format, I'd do something like zip each file individually, then TAR the mess together.

          In a zip archive, each file is compressed individually. The ZIP header just says where each file starts. But each file individually is prefixed with a header that contains everything that you need to decompress it, and to figure out where the next file begins.

          So in other words, a zip file doesn't actually differ much from a tar file with individually compressed files, except that there's a directory at the end which spares you from seeking sequentially through the zip file to find the file to decompress.

          --
          The Tao of math: The numbers you can count are not the real numbers.
          • (Score: 2) by Reziac on Friday September 13 2019, @07:26AM (7 children)

            by Reziac (2489) on Friday September 13 2019, @07:26AM (#893556) Homepage

            Yes, I know. The problem is that without the header, you can't find where each file starts, nor decompress it. So this hypothetical format lets me see that, so I could at least trawl through it looking for that PK marker to tell where each individual ZIP starts, then copy it out and extract it.

            As seen in a hex editor,

            PKbunchofbinarybinarybinarybinarygoodluckyoullneverfindthem (what .DOCX really looks like)

            vs

            PKbinaryPKbinaryPKbinaryPKbinaryPKbinaryPKbinaryPKbinary (what my version would look like)

            Idea being to maximize what can be salvaged from a corrupted file, rather than default to lose the whole thing.

            --
            And there is no Alkibiades to come back and save us from ourselves.
            • (Score: 2) by maxwell demon on Friday September 13 2019, @08:54AM (6 children)

              by maxwell demon (1608) on Friday September 13 2019, @08:54AM (#893563) Journal

              The problem is that without the header, you can't find where each file starts, nor decompress it.

              There's a file header in front of each and every single file, describing that file,and that file only. The first file header starts right at the beginning, and each file header contains the length of the compressed data, so there's no question where the next file begins.

              And in the case one file header is corrupted, or the zip archive doesn't begin at the beginning of the file (common e.g. on DOS/Windows self-extracting archives -- although there the beginning could be determined from the EXE header data), each file header starts with a 4-byte local file header signature (a fixed, specified 4 byte sequence), so you've got a good chance to identify file headers even if you can't make sense of the preceding data.

              Moreover, each file header contains the file name unencrypted, so that's a further way to identify where the file you are looking for resides.

              your not looking for PK, you're looking for the hex sequence 04 03 4B 50. Which has less false positives because (a) it is twice as long, and (b) it is *not* ASCII text, and thus is unlikely to appear in text (while "PK" is only two bytes, and ASCII letters that may well happen to be part of some file name).

              --
              The Tao of math: The numbers you can count are not the real numbers.
              • (Score: 2) by Reziac on Friday September 13 2019, @02:16PM (5 children)

                by Reziac (2489) on Friday September 13 2019, @02:16PM (#893646) Homepage

                Yeah, done enough trawling hex to be aware. My eyes are used to binary. The other problem with a regular zip is even if you can find the body parts, absent an intact header you don't have the info to unpack them.

                --
                And there is no Alkibiades to come back and save us from ourselves.
                • (Score: 2) by maxwell demon on Friday September 13 2019, @04:02PM (4 children)

                  by maxwell demon (1608) on Friday September 13 2019, @04:02PM (#893713) Journal

                  Ah, playing the “moving the goalpost” game.

                  Anyway, there are only a few compression methods, so just try decompression with each of them in turn until you get a working file.

                  But what is the probability that the file header is completely destroyed, but the immediately following file data is kept completely intact? I'd expect very low. And with usual compression methods, if the beginning of a file is corrupted, basically the complete file is gone. And that is true no matter whether you store the compressed file in a zip file, in a tar file, or as standalone file in the filesystem.

                  --
                  The Tao of math: The numbers you can count are not the real numbers.
                  • (Score: 2) by Reziac on Friday September 13 2019, @04:47PM (3 children)

                    by Reziac (2489) on Friday September 13 2019, @04:47PM (#893731) Homepage

                    "if the beginning of a file is corrupted, basically the complete file is gone."

                    See, that's exactly what I've been griping about. Perhaps not well stated but that's the gist of it.

                    --
                    And there is no Alkibiades to come back and save us from ourselves.
                    • (Score: 3, Insightful) by maxwell demon on Friday September 13 2019, @06:01PM (2 children)

                      by maxwell demon (1608) on Friday September 13 2019, @06:01PM (#893769) Journal

                      But the point is that this is not a property of the zip archive format (and has nothing to do with headers), but a general property of compressed files, and thus would also be true for your claimed “better” format (which actually is basically equivalent to zip, as far as you specified it).

                      --
                      The Tao of math: The numbers you can count are not the real numbers.
                      • (Score: 3, Interesting) by Reziac on Friday September 13 2019, @07:17PM (1 child)

                        by Reziac (2489) on Friday September 13 2019, @07:17PM (#893817) Homepage

                        I think I said that earlier. If object is a bulletproof archive, don't compress by any means whatsoever.

                        And what I was going for is making a horrible format (.DOCX) slightly less horrible, assuming the guts remain a gaggle of XML and CSS; it would still suck. Personally I'd get rid of the whole thing.

                        --
                        And there is no Alkibiades to come back and save us from ourselves.
                        • (Score: 0) by Anonymous Coward on Friday September 13 2019, @08:34PM

                          by Anonymous Coward on Friday September 13 2019, @08:34PM (#893842)

                          As an outside observer, I still don't get what your point is. Part of the problem, I think, is that you have an idea of what the PKZIP specifies, but don't actually know, or that you have insufficiently specified how your version is different from zip. Zip files have individual headers located at the start of each compressed file that contains all the information necessary to decompress, verify, and extract that particular file. In addition, there is also the central directory trailer that contains all the information necessary to decompress, verify, and extract each and every file. In the event the trailer is trashed, you can still iterate the file and decompress it; if a file header is trashed, you can use the directory to decompress it. Worst case scenario, you just look for the next magic "PK\x03\x04" "PK\x05\x06" or "PK\x07\x08" in case both the trailer and previous file header is trashed.

                          What, exactly, are you proposing that is different or better?

      • (Score: 2) by maxwell demon on Friday September 13 2019, @05:58AM (2 children)

        by maxwell demon (1608) on Friday September 13 2019, @05:58AM (#893546) Journal

        Disagree. HTML is totally overkill for simple writing. Markdown is absolutely sufficient for most writing tasks.

        On the other hand, if you do need advanced features, HTML is too limited. Use LaTeX in that case.

        --
        The Tao of math: The numbers you can count are not the real numbers.
        • (Score: 0) by Anonymous Coward on Friday September 13 2019, @02:14PM

          by Anonymous Coward on Friday September 13 2019, @02:14PM (#893644)

          LaTeX is too complex and more or less code. Plain text is where it is. Borrow simplest formatting (bold/italic/headings) from Markdown and call it a day.

        • (Score: 2) by The Mighty Buzzard on Friday September 13 2019, @02:16PM

          by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 13 2019, @02:16PM (#893647) Homepage Journal

          My personal preference would be for all things TeX to die in a fire. It's almost, but not quite, as enjoyable as the PDF format. I'll stick with HTML and do anything especially funky in an image file.

          --
          My rights don't end where your fear begins.