Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Tuesday February 06 2024, @03:51AM   Printer-friendly
from the confidentiality-integrity-and-availability dept.

Exotic Silicon has a detailed exploration of how and why to make long term backups.

The myth...

When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD. It's been the classic disaster scenario for decades, assuming that your office doesn't burn down overnight. You sit down in front of your desktop in the morning, and it won't boot. As you reach in to fiddle with SATA cables and clean connections, you realise that the disk isn't even spinning up.

Maybe you knew enough to try a couple of short, sharp, ninety degree twists in the plane of the platters, in case it was caused by stiction. But sooner or later, reality dawns, and it becomes clear that the disk will never spin again. It, along with your data, is gone forever. So a couple of full back-ups at regular intervals should suffice, right?

Except that isn't how it usually happens - most likely you'll be calling on your backups for some other reason.

The reality...

Aside from the fact that when modern SSDs fail they often remain readable, I.E. they become read-only, your data is much more likely to be at risk from silent corruption over time or overwritten due to operator error.

Silent corruption can happen for reasons ranging from bad SATA cables and buggy SSD firmware, to malware and more. Operator error might go genuinely un-noticed, or be covered up.

Both of these scenarios can be protected against with an adequate backup strategy, but the simple approach of a regular, full backup, (which also often goes untested), in many cases just won't suffice.

Aspects like the time interval between backups, how many copies to have and how long to keep them, speed of recovery, and the confidentiality and integrity of said backups are all addressed. Also covered are silent corruption, archiving unchanging data, examples of comprehensive backup plans, and how to correctly store, label, and handle the backup storage media.

Not all storage media have long life spans.


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by sigterm on Tuesday February 06 2024, @05:11AM (14 children)

    by sigterm (849) on Tuesday February 06 2024, @05:11AM (#1343276)

    When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD.

    This is very true, and while inevitable hardware failure should certainly be taken into account when designing a backup strategy, it's not necessarily the most pressing issue.

    All drives will eventually fail, but you can address that with redundancy as well as backups. In fact, redundancy (RAID) is the far better solution, as not only does it create an always-current backup of your system, it also handles hardware failure without downtime. It can also detect and repair silent corruption due to so-called "bit rot" by regularly scrubbing the RAID set.

    (The article tries to make the argument that RAID can make the situation worse by introducing "new failure modes," but this is just the author failing to understand statistics. Yes, the chances of a fault occurring is doubled when you have two drives instead of one, but the point is that with RAID, it takes two or more faults occurring at the same time to cause data loss. Decades of experience has shown RAID to provide excellent protection from hardware failure.)

    As the author points out, what RAID doesn't protect you from, is operator/user error or faulty software/firmware. I'd also add malware/ransomware to that list. Offline backups are simply a must, and they have to go back multiple generations, as many forms of data corruption aren't necessarily detected immediately.

    It's a shame the web site looks like a Geocities page from the 1990s, but the article is well worth reading regardless.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Interesting=2, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 3, Funny) by Anonymous Coward on Tuesday February 06 2024, @06:33AM (3 children)

    by Anonymous Coward on Tuesday February 06 2024, @06:33AM (#1343284)

    > It's a shame the web site looks like a Geocities page from the 1990s, but the article is well worth reading regardless.

    That's on you. There are ten themes to choose from there, available via a click and without javascript.

    • (Score: 4, Touché) by sigterm on Tuesday February 06 2024, @06:51AM

      by sigterm (849) on Tuesday February 06 2024, @06:51AM (#1343285)

      Really? I go to a web page I've never been to before, and it's my fault it looks weird?

    • (Score: 2, Insightful) by Anonymous Coward on Tuesday February 06 2024, @02:43PM

      by Anonymous Coward on Tuesday February 06 2024, @02:43PM (#1343329)

      > ten themes to choose from

      Sure -- link at the bottom of the page changes the color assignments. Does nothing to the layout that I could see, still very much 1990s style with one long page to scroll down. Sometimes I prefer this, instead of multiple shorter linked pages.

    • (Score: 2) by boltronics on Wednesday February 07 2024, @02:20AM

      by boltronics (580) on Wednesday February 07 2024, @02:20AM (#1343441) Homepage Journal

      Indeed. If you don't appreciate those 90's style themes, there's also one named "The 1980s".

      --
      It's GNU/Linux dammit!
  • (Score: 5, Insightful) by turgid on Tuesday February 06 2024, @08:08AM (1 child)

    by turgid (4318) Subscriber Badge on Tuesday February 06 2024, @08:08AM (#1343293) Journal

    Redundancy (several), diversity (different designs, manufacturers, technologies), segregation (physically different locations, barriers). Those three will get you a long way.

    • (Score: 4, Funny) by bzipitidoo on Tuesday February 06 2024, @01:42PM

      by bzipitidoo (4388) on Tuesday February 06 2024, @01:42PM (#1343326) Journal

      "And" (redundancy), "But" (diversity), and "Or" (segregation), they'll get you pretty far!

  • (Score: 5, Interesting) by Crystal on Tuesday February 06 2024, @10:40AM (6 children)

    by Crystal (28042) on Tuesday February 06 2024, @10:40AM (#1343315)

    Hi,

    I work for Exotic Silicon, (the organisation that published the original article).

    Regarding your comment of, "the author failing to understand statistics", and the necessity of two or more faults occurring at the same time for a RAID to cause data loss, this is inaccurate.

    A failure of almost any single part of a typical RAID system can result in unpredictable behaviour.

    Compare these two configurations:

    1. A single drive.

    2. A two-drive RAID mirror, using different disks from different manufacturers.

    The possibility of either one of the drives in the RAID mirror having buggy firmware that can cause silent data corruption is statistically higher than the single drive having buggy firmware.

    Unless the RAID is configured to read from both drives for every read request and verify that the data matches, (which is uncommon), reading from any one drive with buggy firmware can cause corrupt data to be returned, (silently).

    Critically, this corrupt data will be written back out to BOTH drives as if it was good data, if the system does a write back to disk at all, and unless there is some checking of data integrity done at the application level.

    Any further questions feel free to reply or contact us directly.

    Thanks for the interest in our research!

    • (Score: 5, Insightful) by sigterm on Tuesday February 06 2024, @11:41AM (2 children)

      by sigterm (849) on Tuesday February 06 2024, @11:41AM (#1343319)

      Regarding your comment of, "the author failing to understand statistics", and the necessity of two or more faults occurring at the same time for a RAID to cause data loss, this is inaccurate.

      A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

      A failure of almost any single part of a typical RAID system can result in unpredictable behaviour.

      No, it can't. A drive failure will lead to the drive being ejected from the array. That's entirely predictable, and also the most common failure mode by far.

      Any other kind of hardware failure (controller, cables, memory etc) will result in the exact same symptoms as a similar hardware failure in a non-RAID setup.

      Compare these two configurations:
      1. A single drive.

      2. A two-drive RAID mirror, using different disks from different manufacturers.

      The possibility of either one of the drives in the RAID mirror having buggy firmware that can cause silent data corruption is statistically higher than the single drive having buggy firmware.

      You left out the part where a buggy firmware causing silent data corruption is a scenario that:

      1) is extremely rare,

      b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

      c) will be detected by RAID data scrubbing, and may even be fixed if the corruption leads to ECC errors on the drive itself, and

      d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land), will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

      Unless the RAID is configured to read from both drives for every read request and verify that the data matches, (which is uncommon), reading from any one drive with buggy firmware can cause corrupt data to be returned, (silently).

      As mentioned, this is such an atypical scenario that I haven't actually seen a single case of this ever happening. Also, why drag RAID into this, as the exact same problem would occur regardless of disk subsystem configuration?

      Critically, this corrupt data will be written back out to BOTH drives as if it was good data, if the system does a write back to disk at all, and unless there is some checking of data integrity done at the application level.

      You mean, in the same way that the corrupt data would be written to every offline backup without anyone noticing?

      You didn't address the obvious issue, which is that if the chance of one drive failing is X, the chance of two drives failing simultaneously, or even within a specified interval, is much smaller than X.

      RAID is no replacement for backups, and backups are no replacement for RAID. (Unless downtime isn't an issue and your time is worth nothing, I guess.)

      • (Score: 4, Interesting) by Crystal on Tuesday February 06 2024, @07:13PM (1 child)

        by Crystal (28042) on Tuesday February 06 2024, @07:13PM (#1343355)
        A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

        No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus. PSU failures can also cause all sorts of unpredictable behaviour.

        A drive failure will lead to the drive being ejected from the array.

        Nice theory. The reality is very different.

        You left out the part where a buggy firmware causing silent data corruption is a scenario that: 1) is extremely rare,

        The "buggy firmware" example was just the first one of many that I thought of. It's a convenient simple example of a failure scenario, but by no means the only or most likely one.

        b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

        It doesn't need to, that was the entire point of the comment.

        If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

        Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

        c) will be detected by RAID data scrubbing

        No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

        and may even be fixed if the corruption leads to ECC errors on the drive itself

        In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

        Or the entirely wrong block was read. Good ECC, wrong data.

        d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

        We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

        Our comprehensive hashing and checksumming policies caught that silent data corruption, which would probably have gone un-noticed elsewhere.

        will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

        We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

        • (Score: 3, Interesting) by sigterm on Tuesday February 06 2024, @11:48PM

          by sigterm (849) on Tuesday February 06 2024, @11:48PM (#1343419)

          No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus.

          Was it, really? I actually never experienced that in my 30 years working with storage.

          PSU failures can also cause all sorts of unpredictable behaviour.

          Switch mode PSUs typically have one failure mode: Broken.

          A drive failure will lead to the drive being ejected from the array.

          Nice theory. The reality is very different.

          As mentioned, you're now talking to someone with somewhat extensive experience with storage, from pre-LVD SCSI to Fibre Channel and AoE/iSCSI. I haven't really experienced this "very different" reality, and it seems neither have my colleagues.

          If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

          Can you mention even one scenario where this can happen, that doesn't involve the previously mentioned extremely-rare-to-the-point-of-being-purely-theoretical scenario of a peculiar firmware bug?

          Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

          But twice "basically never" is still "basically never."

          c) will be detected by RAID data scrubbing

          No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

          Sorry, but now you're talking plain nonsense. There is no way corrupted data being sent from a drive won't be detected by a scrubbing operation. Sure, in a RAID 1 or RAID 5 setup the controller won't be able to determine which drive is generating bad data, and as such won't be able to automatically correct the error, but it will certainly be detected.

          And since RAID 6 is pretty much the norm for large storage systems today (due in part to the time it takes to rebuild an array), the error will actually be corrected as well.

          In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

          And in that case, all your backups will be silently corrupted, and the error will only ever be caught if you scrub a RAID, or notice that your applications crash or returns obviously erroneous data.

          d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

          We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

          Then you need to name and shame the manufacturer(s) immediately. Short of obvious scam products from China, I've never seen this "in the wild." Haven't seen it reported by Backblaze either, which I would expect given the sheer number of disks they go through every year.

          will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

          We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

          This discussion started with an article praising the virtues of offline backups (which I fully agree with), while also making unsubstantiated claims about RAID. I'm pointing out that RAID will successfully detect and even deal with the exact issues raised; issues that, if real, would render backups useless.

    • (Score: 2, Interesting) by shrewdsheep on Tuesday February 06 2024, @11:44AM

      by shrewdsheep (5215) on Tuesday February 06 2024, @11:44AM (#1343320)

      I second this skepticism of RAID1 solutions. I ran a RAID for a couple of years. The main disadvantages I perceived was the power consumption (both drives constantly on), the required monitoring plus the lack of protection against anything but hardware failures. I never had to recover a full drive but anecdotally, I hear that people ran into errors for the problem of silent corruption that you mention.

      Instead I have settled for an "rsync-RAID" once a day including backup of modified/deleted files. This solutions therefore includes protection against user error as well (once in a while these backups have to be cleared out to retain capacity, though). Additionally, the backup drive is powered down and unmounted for 95% of the time, hopefully extending its life time expectancy and thereby de-correlating failure times of the two drives. I switch drives ever ~5 yrs and I have yet to experience a drive failure.

    • (Score: 2) by canopic jug on Tuesday February 06 2024, @01:08PM (1 child)

      by canopic jug (3949) Subscriber Badge on Tuesday February 06 2024, @01:08PM (#1343323) Journal

      Any further questions feel free to reply or contact us directly.

      Below, ntropia beat me to the question about other file systems [soylentnews.org] like OpenZFS or BtrFS. Those can do file-level checksums. How would they fit in with removable media?

      --
      Money is not free speech. Elections should not be auctions.
      • (Score: 2, Interesting) by Crystal on Tuesday February 06 2024, @06:49PM

        by Crystal (28042) on Tuesday February 06 2024, @06:49PM (#1343351)

        Below, ntropia beat me to the question about other file systems [soylentnews.org] like OpenZFS or BtrFS. Those can do file-level checksums. How would they fit in with removable media?

        For backup or archiving to removable media, the data is usually being written once and then kept unchanged until the media is wiped and re-used rather than being continuously 'in flux', with individual files being updated. So although you could use a filesystem with integrated file-level checksumming, you are trading increased complexity at the filesystem level for little gain over what you could achieve by simply doing a sha256 over the files before writing them.

  • (Score: 3, Interesting) by boltronics on Wednesday February 07 2024, @03:04AM

    by boltronics (580) on Wednesday February 07 2024, @03:04AM (#1343448) Homepage Journal

    RAID can improve uptime in the event of a disk failure, but RAID is not a backup, and it can introduce problems.

    For example, imagine a simple RAID1 array, where the RAID controller or the drives are silently introducing corruption. You now have two drives with different contents. Which one is correct? Without at least three disks, it may not be possible to know. Since there were two disks, you doubled the chance of this problem occurring.

    If you delete a file on your filesystem running on RAID, it's gone for good. That's not the case with an actual backup.

    I personally use btrfs in linear/JBOD mode across two SSDs for my desktop. If an SSD fails and my computer crashes, it's not the end of the world. I get the benefits of checksums, being able to run a scrub, perform resizing, etc.

    I have my backups hosted on another machine, taken using a script I made many years ago (hosted here [github.com] if curious). It uses rsync over SSH and hard links to create a structure like so (where the number of backups kept for each interval type can be adjusted as needed):


    root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root# ls -l
    total 64
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.0
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.1
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.2
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.3
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.4
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 daily.5
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.0
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.1
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.2
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.3
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.4
    drwxr-xr-x 19 root root 4096 May 30 2023 monthly.5
    drwxr-xr-x 19 root root 4096 Jan 24 09:56 weekly.0
    drwxr-xr-x 19 root root 4096 May 30 2023 weekly.1
    drwxr-xr-x 19 root root 4096 May 30 2023 weekly.2
    drwxr-xr-x 19 root root 4096 May 30 2023 yearly.0
    root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root# ls -l yearly.0/
    total 68
    lrwxrwxrwx 16 root root 7 Apr 5 2022 bin -> usr/bin
    drwxr-xr-x 4 root root 4096 Nov 15 11:28 boot
    drwxr-xr-x 3 root root 4096 Jan 1 1970 boot.bak
    drwxr-xr-x 2 root root 4096 Dec 19 20:06 dev
    drwxr-xr-x 93 root root 4096 Dec 31 20:20 etc
    drwxr-xr-x 3 root root 4096 Jun 15 2022 home
    lrwxrwxrwx 16 root root 7 Apr 5 2022 lib -> usr/lib
    drwx------ 2 root root 4096 Apr 5 2022 lost+found
    drwxr-xr-x 2 root root 4096 Apr 5 2022 media
    drwxr-xr-x 2 root root 4096 Apr 5 2022 mnt
    drwxr-xr-x 3 root root 4096 May 30 2023 opt
    dr-xr-xr-x 2 root root 4096 Jan 1 1970 proc
    drwx------ 4 root root 4096 Jan 1 00:14 root
    drwxr-xr-x 25 root root 4096 Jan 1 00:23 run
    lrwxrwxrwx 16 root root 8 Apr 5 2022 sbin -> usr/sbin
    drwxr-xr-x 2 root root 4096 Apr 5 2022 srv
    dr-xr-xr-x 2 root root 4096 Jan 1 1970 sys
    drwxrwxrwt 2 root root 4096 Jan 1 00:14 tmp
    drwxr-xr-x 11 root root 4096 Apr 5 2022 usr
    drwxr-xr-x 11 root root 4096 Apr 5 2022 var
    root@zombie:/var/local/backups/pi4b-0.internal.systemsaviour.com/root#

    This makes it very easy to recover a single file, or all files on an entire filesystem (where I take the stance that anything below the filesystem level can always be recreated easily enough). I actually had to use it to recover from a failed SD card in a Raspberry Pi the other day (which I use to host my website). It also provides for the possibility to check for filesystem corruption by identifying which files are not hard links between any two snapshots (and check for any unexpected differences).

    --
    It's GNU/Linux dammit!