Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 15 submissions in the queue.
posted by janrinok on Tuesday February 06, @03:51AM   Printer-friendly
from the confidentiality-integrity-and-availability dept.

Exotic Silicon has a detailed exploration of how and why to make long term backups.

The myth...

When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD. It's been the classic disaster scenario for decades, assuming that your office doesn't burn down overnight. You sit down in front of your desktop in the morning, and it won't boot. As you reach in to fiddle with SATA cables and clean connections, you realise that the disk isn't even spinning up.

Maybe you knew enough to try a couple of short, sharp, ninety degree twists in the plane of the platters, in case it was caused by stiction. But sooner or later, reality dawns, and it becomes clear that the disk will never spin again. It, along with your data, is gone forever. So a couple of full back-ups at regular intervals should suffice, right?

Except that isn't how it usually happens - most likely you'll be calling on your backups for some other reason.

The reality...

Aside from the fact that when modern SSDs fail they often remain readable, I.E. they become read-only, your data is much more likely to be at risk from silent corruption over time or overwritten due to operator error.

Silent corruption can happen for reasons ranging from bad SATA cables and buggy SSD firmware, to malware and more. Operator error might go genuinely un-noticed, or be covered up.

Both of these scenarios can be protected against with an adequate backup strategy, but the simple approach of a regular, full backup, (which also often goes untested), in many cases just won't suffice.

Aspects like the time interval between backups, how many copies to have and how long to keep them, speed of recovery, and the confidentiality and integrity of said backups are all addressed. Also covered are silent corruption, archiving unchanging data, examples of comprehensive backup plans, and how to correctly store, label, and handle the backup storage media.

Not all storage media have long life spans.


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Interesting) by sigterm on Tuesday February 06, @05:11AM (14 children)

    by sigterm (849) on Tuesday February 06, @05:11AM (#1343276)

    When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD.

    This is very true, and while inevitable hardware failure should certainly be taken into account when designing a backup strategy, it's not necessarily the most pressing issue.

    All drives will eventually fail, but you can address that with redundancy as well as backups. In fact, redundancy (RAID) is the far better solution, as not only does it create an always-current backup of your system, it also handles hardware failure without downtime. It can also detect and repair silent corruption due to so-called "bit rot" by regularly scrubbing the RAID set.

    (The article tries to make the argument that RAID can make the situation worse by introducing "new failure modes," but this is just the author failing to understand statistics. Yes, the chances of a fault occurring is doubled when you have two drives instead of one, but the point is that with RAID, it takes two or more faults occurring at the same time to cause data loss. Decades of experience has shown RAID to provide excellent protection from hardware failure.)

    As the author points out, what RAID doesn't protect you from, is operator/user error or faulty software/firmware. I'd also add malware/ransomware to that list. Offline backups are simply a must, and they have to go back multiple generations, as many forms of data corruption aren't necessarily detected immediately.

    It's a shame the web site looks like a Geocities page from the 1990s, but the article is well worth reading regardless.

    • (Score: 3, Funny) by Anonymous Coward on Tuesday February 06, @06:33AM (3 children)

      by Anonymous Coward on Tuesday February 06, @06:33AM (#1343284)

      > It's a shame the web site looks like a Geocities page from the 1990s, but the article is well worth reading regardless.

      That's on you. There are ten themes to choose from there, available via a click and without javascript.

      • (Score: 4, Touché) by sigterm on Tuesday February 06, @06:51AM

        by sigterm (849) on Tuesday February 06, @06:51AM (#1343285)

        Really? I go to a web page I've never been to before, and it's my fault it looks weird?

      • (Score: 2, Insightful) by Anonymous Coward on Tuesday February 06, @02:43PM

        by Anonymous Coward on Tuesday February 06, @02:43PM (#1343329)

        > ten themes to choose from

        Sure -- link at the bottom of the page changes the color assignments. Does nothing to the layout that I could see, still very much 1990s style with one long page to scroll down. Sometimes I prefer this, instead of multiple shorter linked pages.

      • (Score: 2) by boltronics on Wednesday February 07, @02:20AM

        by boltronics (580) on Wednesday February 07, @02:20AM (#1343441) Homepage Journal

        Indeed. If you don't appreciate those 90's style themes, there's also one named "The 1980s".

        --
        It's GNU/Linux dammit!
    • (Score: 5, Insightful) by turgid on Tuesday February 06, @08:08AM (1 child)

      by turgid (4318) Subscriber Badge on Tuesday February 06, @08:08AM (#1343293) Journal

      Redundancy (several), diversity (different designs, manufacturers, technologies), segregation (physically different locations, barriers). Those three will get you a long way.

      • (Score: 4, Funny) by bzipitidoo on Tuesday February 06, @01:42PM

        by bzipitidoo (4388) on Tuesday February 06, @01:42PM (#1343326) Journal

        "And" (redundancy), "But" (diversity), and "Or" (segregation), they'll get you pretty far!

    • (Score: 5, Interesting) by Crystal on Tuesday February 06, @10:40AM (6 children)

      by Crystal (28042) on Tuesday February 06, @10:40AM (#1343315)

      Hi,

      I work for Exotic Silicon, (the organisation that published the original article).

      Regarding your comment of, "the author failing to understand statistics", and the necessity of two or more faults occurring at the same time for a RAID to cause data loss, this is inaccurate.

      A failure of almost any single part of a typical RAID system can result in unpredictable behaviour.

      Compare these two configurations:

      1. A single drive.

      2. A two-drive RAID mirror, using different disks from different manufacturers.

      The possibility of either one of the drives in the RAID mirror having buggy firmware that can cause silent data corruption is statistically higher than the single drive having buggy firmware.

      Unless the RAID is configured to read from both drives for every read request and verify that the data matches, (which is uncommon), reading from any one drive with buggy firmware can cause corrupt data to be returned, (silently).

      Critically, this corrupt data will be written back out to BOTH drives as if it was good data, if the system does a write back to disk at all, and unless there is some checking of data integrity done at the application level.

      Any further questions feel free to reply or contact us directly.

      Thanks for the interest in our research!

      • (Score: 5, Insightful) by sigterm on Tuesday February 06, @11:41AM (2 children)

        by sigterm (849) on Tuesday February 06, @11:41AM (#1343319)

        Regarding your comment of, "the author failing to understand statistics", and the necessity of two or more faults occurring at the same time for a RAID to cause data loss, this is inaccurate.

        A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

        A failure of almost any single part of a typical RAID system can result in unpredictable behaviour.

        No, it can't. A drive failure will lead to the drive being ejected from the array. That's entirely predictable, and also the most common failure mode by far.

        Any other kind of hardware failure (controller, cables, memory etc) will result in the exact same symptoms as a similar hardware failure in a non-RAID setup.

        Compare these two configurations:
        1. A single drive.

        2. A two-drive RAID mirror, using different disks from different manufacturers.

        The possibility of either one of the drives in the RAID mirror having buggy firmware that can cause silent data corruption is statistically higher than the single drive having buggy firmware.

        You left out the part where a buggy firmware causing silent data corruption is a scenario that:

        1) is extremely rare,

        b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

        c) will be detected by RAID data scrubbing, and may even be fixed if the corruption leads to ECC errors on the drive itself, and

        d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land), will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

        Unless the RAID is configured to read from both drives for every read request and verify that the data matches, (which is uncommon), reading from any one drive with buggy firmware can cause corrupt data to be returned, (silently).

        As mentioned, this is such an atypical scenario that I haven't actually seen a single case of this ever happening. Also, why drag RAID into this, as the exact same problem would occur regardless of disk subsystem configuration?

        Critically, this corrupt data will be written back out to BOTH drives as if it was good data, if the system does a write back to disk at all, and unless there is some checking of data integrity done at the application level.

        You mean, in the same way that the corrupt data would be written to every offline backup without anyone noticing?

        You didn't address the obvious issue, which is that if the chance of one drive failing is X, the chance of two drives failing simultaneously, or even within a specified interval, is much smaller than X.

        RAID is no replacement for backups, and backups are no replacement for RAID. (Unless downtime isn't an issue and your time is worth nothing, I guess.)

        • (Score: 4, Interesting) by Crystal on Tuesday February 06, @07:13PM (1 child)

          by Crystal (28042) on Tuesday February 06, @07:13PM (#1343355)
          A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

          No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus. PSU failures can also cause all sorts of unpredictable behaviour.

          A drive failure will lead to the drive being ejected from the array.

          Nice theory. The reality is very different.

          You left out the part where a buggy firmware causing silent data corruption is a scenario that: 1) is extremely rare,

          The "buggy firmware" example was just the first one of many that I thought of. It's a convenient simple example of a failure scenario, but by no means the only or most likely one.

          b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

          It doesn't need to, that was the entire point of the comment.

          If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

          Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

          c) will be detected by RAID data scrubbing

          No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

          and may even be fixed if the corruption leads to ECC errors on the drive itself

          In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

          Or the entirely wrong block was read. Good ECC, wrong data.

          d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

          We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

          Our comprehensive hashing and checksumming policies caught that silent data corruption, which would probably have gone un-noticed elsewhere.

          will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

          We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

          • (Score: 3, Interesting) by sigterm on Tuesday February 06, @11:48PM

            by sigterm (849) on Tuesday February 06, @11:48PM (#1343419)

            No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus.

            Was it, really? I actually never experienced that in my 30 years working with storage.

            PSU failures can also cause all sorts of unpredictable behaviour.

            Switch mode PSUs typically have one failure mode: Broken.

            A drive failure will lead to the drive being ejected from the array.

            Nice theory. The reality is very different.

            As mentioned, you're now talking to someone with somewhat extensive experience with storage, from pre-LVD SCSI to Fibre Channel and AoE/iSCSI. I haven't really experienced this "very different" reality, and it seems neither have my colleagues.

            If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

            Can you mention even one scenario where this can happen, that doesn't involve the previously mentioned extremely-rare-to-the-point-of-being-purely-theoretical scenario of a peculiar firmware bug?

            Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

            But twice "basically never" is still "basically never."

            c) will be detected by RAID data scrubbing

            No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

            Sorry, but now you're talking plain nonsense. There is no way corrupted data being sent from a drive won't be detected by a scrubbing operation. Sure, in a RAID 1 or RAID 5 setup the controller won't be able to determine which drive is generating bad data, and as such won't be able to automatically correct the error, but it will certainly be detected.

            And since RAID 6 is pretty much the norm for large storage systems today (due in part to the time it takes to rebuild an array), the error will actually be corrected as well.

            In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

            And in that case, all your backups will be silently corrupted, and the error will only ever be caught if you scrub a RAID, or notice that your applications crash or returns obviously erroneous data.

            d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

            We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

            Then you need to name and shame the manufacturer(s) immediately. Short of obvious scam products from China, I've never seen this "in the wild." Haven't seen it reported by Backblaze either, which I would expect given the sheer number of disks they go through every year.

            will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

            We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

            This discussion started with an article praising the virtues of offline backups (which I fully agree with), while also making unsubstantiated claims about RAID. I'm pointing out that RAID will successfully detect and even deal with the exact issues raised; issues that, if real, would render backups useless.

      • (Score: 2, Interesting) by shrewdsheep on Tuesday February 06, @11:44AM

        by shrewdsheep (5215) on Tuesday February 06, @11:44AM (#1343320)

        I second this skepticism of RAID1 solutions. I ran a RAID for a couple of years. The main disadvantages I perceived was the power consumption (both drives constantly on), the required monitoring plus the lack of protection against anything but hardware failures. I never had to recover a full drive but anecdotally, I hear that people ran into errors for the problem of silent corruption that you mention.

        Instead I have settled for an "rsync-RAID" once a day including backup of modified/deleted files. This solutions therefore includes protection against user error as well (once in a while these backups have to be cleared out to retain capacity, though). Additionally, the backup drive is powered down and unmounted for 95% of the time, hopefully extending its life time expectancy and thereby de-correlating failure times of the two drives. I switch drives ever ~5 yrs and I have yet to experience a drive failure.

      • (Score: 2) by canopic jug on Tuesday February 06, @01:08PM (1 child)

        by canopic jug (3949) Subscriber Badge on Tuesday February 06, @01:08PM (#1343323) Journal

        Any further questions feel free to reply or contact us directly.

        Below, ntropia beat me to the question about other file systems [soylentnews.org] like OpenZFS or BtrFS. Those can do file-level checksums. How would they fit in with removable media?

        --
        Money is not free speech. Elections should not be auctions.
        • (Score: 2, Interesting) by Crystal on Tuesday February 06, @06:49PM

          by Crystal (28042) on Tuesday February 06, @06:49PM (#1343351)

          Below, ntropia beat me to the question about other file systems [soylentnews.org] like OpenZFS or BtrFS. Those can do file-level checksums. How would they fit in with removable media?

          For backup or archiving to removable media, the data is usually being written once and then kept unchanged until the media is wiped and re-used rather than being continuously 'in flux', with individual files being updated. So although you could use a filesystem with integrated file-level checksumming, you are trading increased complexity at the filesystem level for little gain over what you could achieve by simply doing a sha256 over the files before writing them.

    • (Score: 3, Interesting) by boltronics on Wednesday February 07, @03:04AM

      by boltronics (580) on Wednesday February 07, @03:04AM (#1343448) Homepage Journal

      RAID can improve uptime in the event of a disk failure, but RAID is not a backup, and it can introduce problems.

      For example, imagine a simple RAID1 array, where the RAID controller or the drives are silently introducing corruption. You now have two drives with different contents. Which one is correct? Without at least three disks, it may not be possible to know. Since there were two disks, you doubled the chance of this problem occurring.

      If you delete a file on your filesystem running on RAID, it's gone for good. That's not the case with an actual backup.

      I personally use btrfs in linear/JBOD mode across two SSDs for my desktop. If an SSD fails and my computer crashes, it's not the end of the world. I get the benefits of checksums, being able to run a scrub, perform resizing, etc.

      I have my backups hosted on another machine, taken using a script I made many years ago (hosted here [github.com] if curious). It uses rsync over SSH and hard links to create a structure like so (where the number of backups kept for each interval type can be adjusted as needed):


      root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root# ls -l
      total 64
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.0
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.1
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.2
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.3
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.4
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.5
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.0
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.1
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.2
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.3
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.4
      drwxr-xr-x 19 root root 4096 May 30 2023 monthly.5
      drwxr-xr-x 19 root root 4096 Jan 24 09:56 weekly.0
      drwxr-xr-x 19 root root 4096 May 30 2023 weekly.1
      drwxr-xr-x 19 root root 4096 May 30 2023 weekly.2
      drwxr-xr-x 19 root root 4096 May 30 2023 yearly.0
      root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root# ls -l yearly.0/
      total 68
      lrwxrwxrwx 16 root root 7 Apr 5 2022 bin -> usr/bin
      drwxr-xr-x 4 root root 4096 Nov 15 11:28 boot
      drwxr-xr-x 3 root root 4096 Jan 1 1970 boot.bak
      drwxr-xr-x 2 root root 4096 Dec 19 20:06 dev
      drwxr-xr-x 93 root root 4096 Dec 31 20:20 etc
      drwxr-xr-x 3 root root 4096 Jun 15 2022 home
      lrwxrwxrwx 16 root root 7 Apr 5 2022 lib -> usr/lib
      drwx------ 2 root root 4096 Apr 5 2022 lost+found
      drwxr-xr-x 2 root root 4096 Apr 5 2022 media
      drwxr-xr-x 2 root root 4096 Apr 5 2022 mnt
      drwxr-xr-x 3 root root 4096 May 30 2023 opt
      dr-xr-xr-x 2 root root 4096 Jan 1 1970 proc
      drwx------ 4 root root 4096 Jan 1 00:14 root
      drwxr-xr-x 25 root root 4096 Jan 1 00:23 run
      lrwxrwxrwx 16 root root 8 Apr 5 2022 sbin -> usr/sbin
      drwxr-xr-x 2 root root 4096 Apr 5 2022 srv
      dr-xr-xr-x 2 root root 4096 Jan 1 1970 sys
      drwxrwxrwt 2 root root 4096 Jan 1 00:14 tmp
      drwxr-xr-x 11 root root 4096 Apr 5 2022 usr
      drwxr-xr-x 11 root root 4096 Apr 5 2022 var
      root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root#

      This makes it very easy to recover a single file, or all files on an entire filesystem (where I take the stance that anything below the filesystem level can always be recreated easily enough). I actually had to use it to recover from a failed SD card in a Raspberry Pi the other day (which I use to host my website). It also provides for the possibility to check for filesystem corruption by identifying which files are not hard links between any two snapshots (and check for any unexpected differences).

      --
      It's GNU/Linux dammit!
  • (Score: 3, Funny) by Anonymous Coward on Tuesday February 06, @07:38AM

    by Anonymous Coward on Tuesday February 06, @07:38AM (#1343287)

    It’s ironic that TFA is about maintaining access to data, but blocks access from VPNs (at least mine) and is blacklisted on web.archive.org, so also no backup possible of the precious information they’re sharing.

  • (Score: 2) by ledow on Tuesday February 06, @08:18AM (2 children)

    by ledow (5567) on Tuesday February 06, @08:18AM (#1343297) Homepage

    "Aside from the fact that when modern SSDs fail they often remain readable"

    But do they? I'm not sure that's true at all.

    The only way to back up is to keep your data on as many media in as many locations as practical, and verify it. That means using all those technologies, in several places, as well as older ones, WORM, cloud, etc. etc. etc.

    It's literally the only way to guarantee any in any significant amount. Everything else is as risk of "backup monoculture" where you put all your eggs in the tape/optical/RAID/whatever basket and then realise that technology has a problem that others don't (e.g. storage temperature/humidity sensitivity, etc.).

    And the only way to keep that going for any significant length of time is to keep copying your data and moving it to new places and technologies and verifying it. The backup you used 20 years ago SHOULD NOT be your backup now - that drive/tape/tech is 20 years old! That's ancient in IT terms and you'll have problems sourcing parts, replacements, drivers, compatible machines, etc.

    • (Score: 4, Interesting) by janrinok on Tuesday February 06, @08:23AM

      by janrinok (52) Subscriber Badge on Tuesday February 06, @08:23AM (#1343299) Journal

      I would agree with the statement as it is written but you are correct to point out that relying on such a property is not a good plan for long term security!

      I have only had 1 SSD fail (so far!) but I did manage to extract some of the data from it. I made a mistake however and became over confident - I thought I would see if I could write to it again. My logic was to see if the SSD could be used in part even if not completely. That screwed up the remaining data that I had still not recovered. I got the essential bit that I needed but lost some data that I would have liked to have kept but only for interest.

      --
      I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
    • (Score: 4, Interesting) by sigterm on Tuesday February 06, @09:14AM

      by sigterm (849) on Tuesday February 06, @09:14AM (#1343308)

      "Aside from the fact that when modern SSDs fail they often remain readable"

      But do they? I'm not sure that's true at all.

      And you would be correct in your assumption.

      When a failing SSD goes read-only, it means the controller and interface are both still fully functional, but the amount of damaged cells in the flash memory chip(s) has reached the threshold that the firmware considers unacceptable. It's similar to a S.M.A.R.T. failure reported by a conventional HDD; there aren't enough free sectors/cells left to handle the growing number of defects.

      This failure mode is the best-case scenario, where the flash cells slowly succumb to wear and tear. In these scenarios you often do have ample time to back up your data, since flash cell failures are usually detected when a block is erased and cells are re-written. When the SSD controller detects a failed cell, it still has the original data cached and can redirect the write operation to an unused cell without data loss.

      However, in the case of catastrophic failure of either the SSD controller or the flash chip, immediate and total data loss can occur without any warning. I've seen this more times than I'm comfortable with, which is why I'm skeptical of using single SSDs in any configuration.

  • (Score: 3, Touché) by driverless on Tuesday February 06, @11:11AM (2 children)

    by driverless (4770) on Tuesday February 06, @11:11AM (#1343317)

    I assume it's from 1992, based on the linked web page. I thought that sort of random layout and angry-fruit-salad ransom-note text died last century.

    • (Score: 2) by Freeman on Tuesday February 06, @03:10PM (1 child)

      by Freeman (732) on Tuesday February 06, @03:10PM (#1343330) Journal

      I most often attribute that design to "someone knowing the best way to do it", like with conspiracy theory sites. If it looks like a cult, it probably is.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by driverless on Wednesday February 07, @02:52AM

        by driverless (4770) on Wednesday February 07, @02:52AM (#1343446)

        Now that you mention it, yeah, that was more or less my reaction too.

  • (Score: 4, Insightful) by ntropia on Tuesday February 06, @01:00PM (3 children)

    by ntropia (2577) on Tuesday February 06, @01:00PM (#1343322)

    I really don't like the take on RAID systems, the probability estimates are arguably inaccurate.

    However I'm surprised ZFS or any other copy-on-write filesystem are not even mentioned.
    ZFS manages snapshots, bit-rotten errors, and a plethora of RAID configurations.
    That alone covers all operator errors, and together with a good backup system, it covers pretty much every scenario worth considering.

    • (Score: 0) by Anonymous Coward on Tuesday February 06, @05:20PM (1 child)

      by Anonymous Coward on Tuesday February 06, @05:20PM (#1343345)

      However I'm surprised ZFS or any other copy-on-write filesystem are not even mentioned.
      ZFS manages snapshots, bit-rotten errors, and a plethora of RAID configurations.
      That alone covers all operator errors, and together with a good backup system, it covers pretty much every scenario worth considering.

      Automatic snapshots work very well to deal with "oh shit, I accidentally deleted that data I needed". You don't need fancy filesystems for this, you can use LVM with a traditional filesystem to achieve the same thing.

      Automatic snapshots don't do well to deal with "oh shit, I accidentally ran a job that filled the entire disk", as unless you notice the problem before the next snapshot now you need some administrator intervention to delete data from the snapshots to free up disk space to make it usable again.

      Both scenarios happen with alarming regularity, but I think on balance automatic snapshots are worth the trouble.

      My personal experience with btrfs is not good. Linux 5.2 had a bug that would destroy an entire btrfs filesystem if you hit it (which I did. Not to say that other filesystems are bug-free, but btrfs is the only one I've seen destroyed so thoroughly), and (more importantly) the filesystem seems to require basically continuous manual intervention to maintain acceptable performance.

      Only on btrfs have I run into scenarios where df tells you there's gigabytes of free space available but then when you try to create a new file it fails with -ENOSPC, so you try to delete a file and that also fails with -ENOSPC, so hopefully you have root access to manually run some arcane balance command to make the system work again. Btrfs people will tell you that this behaviour of df is eminently reasonable and they will tell you the question of "how much disk space is available?" is inherently ill-specified which probably makes perfect sense to btrfs people but the behaviour is totally incomprehensible to normal people who do not dedicate their lives to researching copy-on-write filesystems.

      I don't have enough personal experience with ZFS to know if it fares significantly better. But if I need copy-on-write snapshots on Linux I'll take LVM+XFS over btrfs any day of the week.

      • (Score: 2) by owl on Wednesday February 07, @03:51AM

        by owl (15206) on Wednesday February 07, @03:51AM (#1343453)

        After having had btrfs ruin an entire array because one disk started doing normal "disk failure" things (random read errors) for which a specific Linux device driver existed in Linux before btrfs did and which allows one to simulate just exactly this failure mode (so it could have been tested against) I decided that the btrfs developers were not to be trusted and simply unable to produce a sensible filesystem. Therefore I will never again ever run btrfs on any disk. It is on my short list of banned filesystems (hint: the list is one item long).

    • (Score: 2) by hendrikboom on Tuesday February 06, @10:47PM

      by hendrikboom (1125) Subscriber Badge on Tuesday February 06, @10:47PM (#1343404) Homepage Journal

      And the new bcachefs. I wonder how long until there's enough experience for me to trust it.

  • (Score: 2) by Whoever on Tuesday February 06, @03:59PM (5 children)

    by Whoever (4524) on Tuesday February 06, @03:59PM (#1343334) Journal

    The article promotes the idea of keeping tape archives of backups, but fails to discuss the problem of actually finding a working tape drive that will read the tape when it becomes necessary to read the archive. Increasing amounts of data mean that the 30-year old drive won't be in current use for making new backups.

    • (Score: 2) by owl on Tuesday February 06, @04:35PM

      by owl (15206) on Tuesday February 06, @04:35PM (#1343341)

      Indeed, and even modern LTO tapes won't help you there, as they are marketed specifically as only reading one older generation LTO tape only. So one has to buy a "new tape" drive before "old tape" drive fails completely, and then migrate all the data on "old tapes" to new tapes compatible with the new drive, in order to keep all those 'archives" readable.

      The setup is clearly aimed at the corporate market, that "upgrades" on a time cycle (say three years) rather than a "hardware has failed" cycle, and that pays folks to "migrate data" forward from old versions (or has a tape robot that will handle the migration automatically).

    • (Score: 2) by hendrikboom on Tuesday February 06, @10:54PM (3 children)

      by hendrikboom (1125) Subscriber Badge on Tuesday February 06, @10:54PM (#1343405) Homepage Journal

      I had a duplicate of a tape drive stored in a friend's sock drawer on another continent.
      When my copy turned out to be corrupt I called my friend.
      He contacted an acquaintance of his in IBM Netherlands, who was quite happy to take it on as a challenge. Turns out he was working in long-term archiving and data recovery, had tape drives that were compatible with ones from oh ages ago, and was delighted to have a tape from 1979 to work on. This was about 30 years later.
      He could read the tape. It was emailed to me as an attachment. Times have changed.

      • (Score: 2) by Whoever on Tuesday February 06, @11:05PM (2 children)

        by Whoever (4524) on Tuesday February 06, @11:05PM (#1343407) Journal

        Your anecdote is very interesting, but few people have a friend working in long-term archival. You were a special case and lucky.

        • (Score: 2) by sigterm on Wednesday February 07, @01:59AM (1 child)

          by sigterm (849) on Wednesday February 07, @01:59AM (#1343433)

          Any company offering professional data recovery services will have a selection of old but working tape drives, and will be able to recover data from pretty much every tape format that has ever existed.

          But you're making a good point: A long-term archival strategy absolutely should include a device capable of reading any relevant media, possibly along with the necessary controller and even a complete computer. I expect that in just a few years, you'll be hard-pressed to find a PCIe SCSI controller, and motherboards with PCI, EISA, or ISA slots can only be sourced in the "retro" section on eBay.

          • (Score: 0) by Anonymous Coward on Thursday February 08, @07:27PM

            by Anonymous Coward on Thursday February 08, @07:27PM (#1343651)

            Any company offering professional data recovery services will have a selection of old but working tape drives, and will be able to recover data from pretty much every tape format that has ever existed.

            Well it depends how many old tapes you have!

            The BBC migrated their entire analogue video archive to D-3 tape in the early 1990s, and now has the better part of half a million D-3 tapes in their archive. To date, less than half of this archive has been converted to a newer format. It is doubtful that all the remaining working D-3 tape heads in the world can survive long enough to read the entire remaining archive.

            Fun times ahead!

(1)