Stories
Slash Boxes
Comments

SoylentNews is people

Submission Preview

Link to Story

Data-Destroying Defect Found In OpenZFS 2.2.0

Accepted submission by Arthur T Knackerbracket at 2023-11-27 15:42:22
Software

--- --- --- --- Entire Story Below - Must Be Edited --- --- --- --- --- --- ---

Arthur T Knackerbracket has processed the following story [theregister.com]:

A data-destroying bug has been discovered in a new feature of OpenZFS 2.2.0, as found in FreeBSD 14 among other OSes.

OpenZFS 2.2.0 was released just last month [github.com] with a new feature called block cloning [github.com], as we reported when we looked at release candidate 3 [theregister.com]. Unfortunately, there's a bug in that code somewhere, as found by Gentoo-toting Terin Stock [terinstock.com], who reported bug #15526 [github.com]. As a result, OpenZFS 2.2.1 [github.com] is already out, which disables the new feature.

This is a bit of an embarrassment for OpenZFS, a project with an enviable reputation for data integrity. It's also less than ideal for fixed-release-cycle OSes that have the new version of OpenZFS, including the newly released FreeBSD 14 [theregister.com]. Fortunately for FreeBSD, though, version 14.0 ships with the feature disabled by default [wordpress.com].

We have mentioned [theregister.com] the work [theregister.com] of BSD boffin Colin Percival before, but anyone brave enough to have already installed this point-zero release should heed his warning [twitter.com] on Twitter X: "FreeBSD 14's ZFS code supports 'block cloning'. This is turned off by default. DO NOT ENABLE THIS FEATURE UNLESS YOU WANT TO LOSE DATA."

The bug manifests as corruption of the contents of files when they're copied; instead of their expected contents, there are stretches of zeroes, mixed with blocks of what looks like Base64-encoded data. It showed up when using Gentoo's portage command, the distro's package-management tool – an operation that typically involves copying lots of data. Worse still is that the file system's own health checks don't detect any problem. For now, release 2.2.1 simply disables the feature.

At the time of writing, it's not certain exactly what causes it. It seems to be an extremely specific (and therefore unlikely) combination of circumstances, which means it almost never happens, as Bronek Kozicki spells out [github.com] on GitHub:

You need to understand the mechanism that causes corruption. It might have been there for decade and only caused issues in a very specific scenarios, which do not normally happen. Unless you can match your backup mechanism to the conditions described below, you are very unlikely to have been affected by it.

One of the bug hunters has written a tiny script, reproducer.sh [github.com], which hammers ZFS volumes and checks to see if files are getting corrupted. One of the problems around this issue is that there's no way to write a program that can report if a file has been corrupted or not by inspecting its contents: it's perfectly normal for some types of file to contain long stretches of zeroes. The only way to be sure is comparing checksums from before and after copy operations – so concerned users who lack backups held on other types of file system cannot readily tell. OpenZFS's built-in scrub tool for checking the validity of storage pools cannot detect the problem.

A possible fix [github.com] is open, and the investigation looks like it may have uncovered an underlying, different, and pre-existing bug, which could have been present as long ago as 2013 [github.com]. The bug revolves around ZFS dnodes [utoronto.ca], and the logic of how the code checks whether a dnode is "dirty" or not, which governs whether it must flush it [openzfs.org]: sync any the changes to disk.

It's possible that this single cause was deeply hidden, and so very unlikely to be hit. Unfortunately, the new faster copy functionality meant that what used to be a bug that would only corrupt data once in tens of millions of file copies, suddenly became more likely, especially on machines with lots of processor cores all in simultaneous use.

For Linux users, an additional condition seems to be that the OS has a recent version of the coreutils [gnu.org] package – above version 9.x. This is the tool that provides the functionality of the cp command. So far, we have also not been able to verify if Ubuntu 23.10 has the block clone feature enabled by default in its recently returned [theregister.com] (but still experimental) support for being installed onto ZFS, but at least one comment to the original bug is by someone who has reproduced it on Ubuntu.

It seems very likely that OpenZFS 2.2.1, which simply turns off block-cloning, will quickly be followed by a 2.2.2 release to fix the underlying dnode handling. ®


Original Submission