Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 19 submissions in the queue.
posted by janrinok on Tuesday February 06, @03:51AM   Printer-friendly
from the confidentiality-integrity-and-availability dept.

Exotic Silicon has a detailed exploration of how and why to make long term backups.

The myth...

When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD. It's been the classic disaster scenario for decades, assuming that your office doesn't burn down overnight. You sit down in front of your desktop in the morning, and it won't boot. As you reach in to fiddle with SATA cables and clean connections, you realise that the disk isn't even spinning up.

Maybe you knew enough to try a couple of short, sharp, ninety degree twists in the plane of the platters, in case it was caused by stiction. But sooner or later, reality dawns, and it becomes clear that the disk will never spin again. It, along with your data, is gone forever. So a couple of full back-ups at regular intervals should suffice, right?

Except that isn't how it usually happens - most likely you'll be calling on your backups for some other reason.

The reality...

Aside from the fact that when modern SSDs fail they often remain readable, I.E. they become read-only, your data is much more likely to be at risk from silent corruption over time or overwritten due to operator error.

Silent corruption can happen for reasons ranging from bad SATA cables and buggy SSD firmware, to malware and more. Operator error might go genuinely un-noticed, or be covered up.

Both of these scenarios can be protected against with an adequate backup strategy, but the simple approach of a regular, full backup, (which also often goes untested), in many cases just won't suffice.

Aspects like the time interval between backups, how many copies to have and how long to keep them, speed of recovery, and the confidentiality and integrity of said backups are all addressed. Also covered are silent corruption, archiving unchanging data, examples of comprehensive backup plans, and how to correctly store, label, and handle the backup storage media.

Not all storage media have long life spans.


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by Crystal on Tuesday February 06, @07:13PM (1 child)

    by Crystal (28042) on Tuesday February 06, @07:13PM (#1343355)
    A more precise statement would be "a second or third fault would have to occur within the window of time from the first fault occurring and the array being restored to a redundant state."

    No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus. PSU failures can also cause all sorts of unpredictable behaviour.

    A drive failure will lead to the drive being ejected from the array.

    Nice theory. The reality is very different.

    You left out the part where a buggy firmware causing silent data corruption is a scenario that: 1) is extremely rare,

    The "buggy firmware" example was just the first one of many that I thought of. It's a convenient simple example of a failure scenario, but by no means the only or most likely one.

    b) is pretty much guaranteed to never occur simultaneously on two drives from different manufacturers within in the lifetime of the Universe,

    It doesn't need to, that was the entire point of the comment.

    If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

    Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

    c) will be detected by RAID data scrubbing

    No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

    and may even be fixed if the corruption leads to ECC errors on the drive itself

    In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

    Or the entirely wrong block was read. Good ECC, wrong data.

    d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

    We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

    Our comprehensive hashing and checksumming policies caught that silent data corruption, which would probably have gone un-noticed elsewhere.

    will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

    We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

    Starting Score:    1  point
    Moderation   +3  
       Interesting=2, Informative=1, Total=3
    Extra 'Interesting' Modifier   0  

    Total Score:   4  
  • (Score: 3, Interesting) by sigterm on Tuesday February 06, @11:48PM

    by sigterm (849) on Tuesday February 06, @11:48PM (#1343419)

    No, a single fault can easily take out a RAID. This was even more common when multi-drop SCSI cabling was the norm for RAID. A single failing drive would often down the entire SCSI bus.

    Was it, really? I actually never experienced that in my 30 years working with storage.

    PSU failures can also cause all sorts of unpredictable behaviour.

    Switch mode PSUs typically have one failure mode: Broken.

    A drive failure will lead to the drive being ejected from the array.

    Nice theory. The reality is very different.

    As mentioned, you're now talking to someone with somewhat extensive experience with storage, from pre-LVD SCSI to Fibre Channel and AoE/iSCSI. I haven't really experienced this "very different" reality, and it seems neither have my colleagues.

    If any one of the drives in the mirror sends corrupt data unchecked over the bus to the host adaptor, you have a problem.

    Can you mention even one scenario where this can happen, that doesn't involve the previously mentioned extremely-rare-to-the-point-of-being-purely-theoretical scenario of a peculiar firmware bug?

    Data spread over more drives = statistically more chance of this happening, (due to certain specific failure modes)

    But twice "basically never" is still "basically never."

    c) will be detected by RAID data scrubbing

    No it will NOT. At least, not reliably, (unless it happens during the scrub, and never at other times.)

    Sorry, but now you're talking plain nonsense. There is no way corrupted data being sent from a drive won't be detected by a scrubbing operation. Sure, in a RAID 1 or RAID 5 setup the controller won't be able to determine which drive is generating bad data, and as such won't be able to automatically correct the error, but it will certainly be detected.

    And since RAID 6 is pretty much the norm for large storage systems today (due in part to the time it takes to rebuild an array), the error will actually be corrected as well.

    In the buggy firmware scenario, it's entirely plausible that there will be no ECC error reported, because the data read from the media was correct and passed ECC, but was then corrupted by the firmware bug before being sent over the wire.

    And in that case, all your backups will be silently corrupted, and the error will only ever be caught if you scrub a RAID, or notice that your applications crash or returns obviously erroneous data.

    d) if truly silent, as in the drive returns corrupt data with no error message (and now we're pretty much into fantasy land)

    We literallly have several SSDs here which have done exactly that in the last couple of years, so it's hardly 'fantasy land'.

    Then you need to name and shame the manufacturer(s) immediately. Short of obvious scam products from China, I've never seen this "in the wild." Haven't seen it reported by Backblaze either, which I would expect given the sheer number of disks they go through every year.

    will affect backups every bit as much as the disk subsystem, so now you have a reliable backup of corrupted data, which is not great.

    We seem to have drifted far away from the original issue at this point. This seems like a whole different discussion.

    This discussion started with an article praising the virtues of offline backups (which I fully agree with), while also making unsubstantiated claims about RAID. I'm pointing out that RAID will successfully detect and even deal with the exact issues raised; issues that, if real, would render backups useless.