Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Tuesday February 06, @03:51AM   Printer-friendly
from the confidentiality-integrity-and-availability dept.

Exotic Silicon has a detailed exploration of how and why to make long term backups.

The myth...

When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD. It's been the classic disaster scenario for decades, assuming that your office doesn't burn down overnight. You sit down in front of your desktop in the morning, and it won't boot. As you reach in to fiddle with SATA cables and clean connections, you realise that the disk isn't even spinning up.

Maybe you knew enough to try a couple of short, sharp, ninety degree twists in the plane of the platters, in case it was caused by stiction. But sooner or later, reality dawns, and it becomes clear that the disk will never spin again. It, along with your data, is gone forever. So a couple of full back-ups at regular intervals should suffice, right?

Except that isn't how it usually happens - most likely you'll be calling on your backups for some other reason.

The reality...

Aside from the fact that when modern SSDs fail they often remain readable, I.E. they become read-only, your data is much more likely to be at risk from silent corruption over time or overwritten due to operator error.

Silent corruption can happen for reasons ranging from bad SATA cables and buggy SSD firmware, to malware and more. Operator error might go genuinely un-noticed, or be covered up.

Both of these scenarios can be protected against with an adequate backup strategy, but the simple approach of a regular, full backup, (which also often goes untested), in many cases just won't suffice.

Aspects like the time interval between backups, how many copies to have and how long to keep them, speed of recovery, and the confidentiality and integrity of said backups are all addressed. Also covered are silent corruption, archiving unchanging data, examples of comprehensive backup plans, and how to correctly store, label, and handle the backup storage media.

Not all storage media have long life spans.


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by sigterm on Tuesday February 06, @11:41AM (2 children)

    by sigterm (849) on Tuesday February 06, @11:41AM (#1343319)

    Regarding your comment of, "the author failing to understand statistics", and the necessity of two or more faults occurring at the same time for a RAID to cause data loss, this is inaccurate.

    A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

    A failure of almost any single part of a typical RAID system can result in unpredictable behaviour.

    No, it can't. A drive failure will lead to the drive being ejected from the array. That's entirely predictable, and also the most common failure mode by far.

    Any other kind of hardware failure (controller, cables, memory etc) will result in the exact same symptoms as a similar hardware failure in a non-RAID setup.

    Compare these two configurations:
    1. A single drive.

    2. A two-drive RAID mirror, using different disks from different manufacturers.

    The possibility of either one of the drives in the RAID mirror having buggy firmware that can cause silent data corruption is statistically higher than the single drive having buggy firmware.

    You left out the part where a buggy firmware causing silent data corruption is a scenario that:

    1) is extremely rare,

    b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

    c) will be detected by RAID data scrubbing, and may even be fixed if the corruption leads to ECC errors on the drive itself, and

    d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land), will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

    Unless the RAID is configured to read from both drives for every read request and verify that the data matches, (which is uncommon), reading from any one drive with buggy firmware can cause corrupt data to be returned, (silently).

    As mentioned, this is such an atypical scenario that I haven't actually seen a single case of this ever happening. Also, why drag RAID into this, as the exact same problem would occur regardless of disk subsystem configuration?

    Critically, this corrupt data will be written back out to BOTH drives as if it was good data, if the system does a write back to disk at all, and unless there is some checking of data integrity done at the application level.

    You mean, in the same way that the corrupt data would be written to every offline backup without anyone noticing?

    You didn't address the obvious issue, which is that if the chance of one drive failing is X, the chance of two drives failing simultaneously, or even within a specified interval, is much smaller than X.

    RAID is no replacement for backups, and backups are no replacement for RAID. (Unless downtime isn't an issue and your time is worth nothing, I guess.)

    Starting Score:    1  point
    Moderation   +3  
       Insightful=2, Interesting=1, Total=3
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 4, Interesting) by Crystal on Tuesday February 06, @07:13PM (1 child)

    by Crystal (28042) on Tuesday February 06, @07:13PM (#1343355)
    A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

    No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus. PSU failures can also cause all sorts of unpredictable behaviour.

    A drive failure will lead to the drive being ejected from the array.

    Nice theory. The reality is very different.

    You left out the part where a buggy firmware causing silent data corruption is a scenario that: 1) is extremely rare,

    The "buggy firmware" example was just the first one of many that I thought of. It's a convenient simple example of a failure scenario, but by no means the only or most likely one.

    b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

    It doesn't need to, that was the entire point of the comment.

    If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

    Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

    c) will be detected by RAID data scrubbing

    No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

    and may even be fixed if the corruption leads to ECC errors on the drive itself

    In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

    Or the entirely wrong block was read. Good ECC, wrong data.

    d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

    We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

    Our comprehensive hashing and checksumming policies caught that silent data corruption, which would probably have gone un-noticed elsewhere.

    will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

    We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

    • (Score: 3, Interesting) by sigterm on Tuesday February 06, @11:48PM

      by sigterm (849) on Tuesday February 06, @11:48PM (#1343419)

      No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus.

      Was it, really? I actually never experienced that in my 30 years working with storage.

      PSU failures can also cause all sorts of unpredictable behaviour.

      Switch mode PSUs typically have one failure mode: Broken.

      A drive failure will lead to the drive being ejected from the array.

      Nice theory. The reality is very different.

      As mentioned, you're now talking to someone with somewhat extensive experience with storage, from pre-LVD SCSI to Fibre Channel and AoE/iSCSI. I haven't really experienced this "very different" reality, and it seems neither have my colleagues.

      If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

      Can you mention even one scenario where this can happen, that doesn't involve the previously mentioned extremely-rare-to-the-point-of-being-purely-theoretical scenario of a peculiar firmware bug?

      Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

      But twice "basically never" is still "basically never."

      c) will be detected by RAID data scrubbing

      No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

      Sorry, but now you're talking plain nonsense. There is no way corrupted data being sent from a drive won't be detected by a scrubbing operation. Sure, in a RAID 1 or RAID 5 setup the controller won't be able to determine which drive is generating bad data, and as such won't be able to automatically correct the error, but it will certainly be detected.

      And since RAID 6 is pretty much the norm for large storage systems today (due in part to the time it takes to rebuild an array), the error will actually be corrected as well.

      In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

      And in that case, all your backups will be silently corrupted, and the error will only ever be caught if you scrub a RAID, or notice that your applications crash or returns obviously erroneous data.

      d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

      We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

      Then you need to name and shame the manufacturer(s) immediately. Short of obvious scam products from China, I've never seen this "in the wild." Haven't seen it reported by Backblaze either, which I would expect given the sheer number of disks they go through every year.

      will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

      We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

      This discussion started with an article praising the virtues of offline backups (which I fully agree with), while also making unsubstantiated claims about RAID. I'm pointing out that RAID will successfully detect and even deal with the exact issues raised; issues that, if real, would render backups useless.