SoylentNews Comments | Btrfs RAID 5/6 Code Completely Hosed

Btrfs RAID 5/6 Code Completely Hosed

posted by cmn32480 on Sunday August 07 2016, @02:38PM

from the how-bad-could-it-really-be dept.

The nice feller over at phoronix brings us this handy to have bit of info:

It turns out the RAID5 and RAID6 code for the Btrfs file-system's built-in RAID support is faulty and users should not be making use of it if you care about your data.
There has been this mailing list thread since the end of July about Btrfs scrub recalculating the wrong parity in RAID5. The wrong parity and unrecoverable errors has been confirmed by multiple parties. The Btrfs RAID 5/6 code has been called as much as fatally flawed -- "more or less fatally flawed, and a full scrap and rewrite to an entirely different raid56 mode on-disk format may be necessary to fix it. And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible."

Just as well I haven't gotten around to trying it then.

Original Submission

This discussion has been archived. No new comments can be posted.

Btrfs RAID 5/6 Code Completely Hosed | Log In/Create an Account | Top | 34 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Title needs correctionTitle needs correction (Score: 2, Interesting) by Anonymous Coward on Sunday August 07 2016, @03:30PM

by Anonymous Coward on Sunday August 07 2016, @03:30PM (#384984)

This should be the title:

Btrfs Code Completely Hosed

Why? Because it is not just the RAID code that is hosed, it is ALL of BTRFS that is hosed.

Personal experience, large disk array, running RAID1 (mirroring). Filesystem would run for months no problem, then, suddenly, hose itself, with no recovery beyond re-init and reload. It did this twice, after #2, it was replaced by ext4. The ext4 filesystem's been running for several years now, never hosed itself

Prior to the second hosing of the large disk array, a workstation system was setup with BTRFS in plain mode (no raid, nothing special, just a filesystem). Came home from work one day, and it too had hosed itself with no recovery beyond a restore from backup. Well, the restore from backup was into ext4 as well, and that FS ran until the disk holding it gave up (hardware failure)

So, after three hosings, this Linux user will never use BTRFS for anything ever again. And I recommend you do not use it for anything either. It is, at best, still pre alpha quality code

Starting Score:	0		points
Moderation		+2
Flamebait=1, Interesting=2, Informative=1, Total=4
Extra 'Interesting' Modifier		0

Total Score:		2

Re:Title needs correctionRe:Title needs correction (Score: 1, Informative) by Anonymous Coward on Sunday August 07 2016, @06:01PM

by Anonymous Coward on Sunday August 07 2016, @06:01PM (#385009)

Different AC here, but I've had similar experiences with btrfs killing itself. It was on a fresh install of opensuse, I believe, and it didn't last a week before it wouldn't boot. Nothing fancy, no raid or the like, just pure defaults, hence why I used it at all. Didn't inspire confidence in the file system or the distro that picked it.

Parent
- Re:Title needs correctionRe:Title needs correction (Score: 2) by rleigh on Sunday August 07 2016, @06:48PM
  
  by rleigh (4887) on Sunday August 07 2016, @06:48PM (#385017) Homepage
  
  You most likely hit the unbalancing issue which makes it go read-only. I've seen this repeatedly. A rebalance would fix it, though it takes ages and kills the system's performance while it grinds away. Note I use "issue" rather than "bug" because it's more of a design flaw, though the implementation is undoubtedly also defective. Search for "btrfs read only balance" for a whole lot of detail about it. When I initially read that SuSE was going to default to Btrfs my initial reaction was utter disbelief; how on earth they justified doing that I hate to think.
  
  Parent
  - Re:Title needs correctionRe:Title needs correction (Score: 0) by Anonymous Coward on Sunday August 07 2016, @08:54PM
    
    by Anonymous Coward on Sunday August 07 2016, @08:54PM (#385042)
    
    It's also default on SUSE commercial distro that is supported until 2030 or something, so .... but you can also select Ext4 or XFS too. But some influential people were pushing for it so SUSE is stuck with it now - hopefully it's not too buggy!
    
    Parent
    - Re:Title needs correction(Score: 0) by Anonymous Coward on Monday August 08 2016, @10:45PM
      
      by Anonymous Coward on Monday August 08 2016, @10:45PM (#385519)
      
      Commercial distro, eh? As in commercial support? Great way to get people to call you for support would be to have the file system hose itself seemingly at random, especially at the times they are must desperate.
      
      Parent
  - Re:Title needs correction(Score: 0) by Anonymous Coward on Sunday August 07 2016, @09:12PM
    
    by Anonymous Coward on Sunday August 07 2016, @09:12PM (#385044)
    
    Nope, in my SuSE case, it was a single disk VM and mounting it into another VM with btrfs support reported errors when running through the diagnostic procedure on the wiki. But, if you have stupid things pop up like rebalancing in normal use in less than a week on a single disk after light use, assuming that was even the problem instead of actual corruption, then btrfs is total garbage and SuSE is too for having garbage defaults.
    
    Parent
Re:Title needs correction(Score: 2) by rleigh on Sunday August 07 2016, @06:23PM

by rleigh (4887) on Sunday August 07 2016, @06:23PM (#385013) Homepage

I've had bad experiences with it since the start. I wrote up some of the issues I had in this related thread: https://news.ycombinator.com/item?id=12232907#12233154 [ycombinator.com] These horrific problems are by no means isolated instances. Even if the RAID1 code is now fixed, I've lost all trust in it. There's just too much stuff which is fundamentally broken, and that's just not acceptable in a filesystem. I'm simply not prepared to lose any more data or downtime to it. I had high hopes for it, but it's turned into a seriously bad joke. Too many times people have told me, "oh, you need to upgrade the the latest kernel for $fix". How many times do you consider it acceptable for me to lose my data? Sorry, but it's not ready for production use, and it never has been.
Over the last 2.5 years, I've been using ZFS on FreeBSD. What an absolute revelation and joy to use after 15 years of mdraid and LVM (and Btrfs). I wish I'd discovered it years before; I've got systemd to thank for that, and I'm genuinely happy that it gave me the push to test the waters outside the (increasingly insular) Linux sphere.
But ZFS is getting much better supported on Linux as well. With Ubuntu 16.04, it's possible to boot directly to a root filesystem on ZFS, with /boot on ZFS. It's still a little rough--not supported directly by the installer--but all the pieces are there in GRUB, the initramfs, the init scripts etc. With a little pain and a few tries and failures, I got it booting directly with EFI and GRUB2. The only missing piece to get this generally usable is an option in the installer like you have with FreeBSD, and then it will be a piece of cake to get up and running.
To be fair though, this isn't all the fault of the Btrfs developers. The number of uninformed fanboys parrotting how great it was and how we should all be using it belied the reality that those of us who heavily tested it for years discovered to our cost. Not so long ago the slightest criticism or caution was jumped upon in some quarters as though it was some sort of betrayal. No, it was simple common sense borne out of actual informed real-world experience with it! Blind faith in it won't make it magically reliable and stop it toasting all your data! I think that these people did a great disservice to anyone who followed their advice, particularly if they suffered from dataloss.
For anyone interested in trying out ZFS on Linux as a rootfs (sorry about the mangled spacing, it should be more readable). Note it also includes a zvol as the swap device.
% lsb_release -cr Release: 16.04 Codename: xenial % sudo zfs list NAME USED AVAIL REFER MOUNTPOINT fdata 134G 315G 96K /fdata fdata/old-root-backup 8.85G 315G 8.85G /fdata/old-root-backup fdata/rleigh 156M 315G 96K /fdata/rleigh fdata/rleigh/clion 156M 315G 156M /fdata/rleigh/clion fdata/schroot 201M 315G 96K /fdata/schroot fdata/schroot/sid 200M 315G 200M /fdata/schroot/sid fdata/vmware 125G 315G 125G /fdata/vmware rpool 21.9G 85.7G 96K none rpool/ROOT 7.53G 85.7G 96K none rpool/ROOT/default 7.53G 85.7G 7.25G / rpool/home 308K 85.7G 96K none rpool/home/root 212K 85.7G 132K /root rpool/opt 1.74G 85.7G 475M /opt rpool/opt/steam 1.27G 85.7G 1.27G /opt/steam rpool/swap 8.50G 93.7G 510M - rpool/var 4.06G 85.7G 96K none rpool/var/cache 4.06G 85.7G 4.02G /var/cache rpool/var/log 3.06M 85.7G 2.95M /var/log rpool/var/spool 168K 85.7G 104K /var/spool rpool/var/tmp 200K 85.7G 128K /var/tmp

Parent
Re:Title needs correction(Score: 3, Interesting) by frojack on Sunday August 07 2016, @06:49PM

by frojack (1554) on Sunday August 07 2016, @06:49PM (#385018) Journal

Same here. With the Second data loss (slow learner) I kicked BTRFS off my machines.
It solves a lot of problems that no one new existed, and tries to be the systemd of file systems.
You never know how much free space you actually have. And deleting data can actually take more space in unexpected ways as the cows come home to roost. (yeah, mixed that metaphor on purpose).
It really is a mess, and Opensuse, true to form, pushed it as the default filesystem. For For Joe User, there is zero advantage. For Big Data user there is every reason to avoid this crap.
ext4 and xfs for me. (And software raid).

--
No, you are mistaken. I've always had this sig.

Parent
Re:Title needs correctionRe:Title needs correction (Score: 2) by darkfeline on Sunday August 07 2016, @09:02PM

by darkfeline (1030) on Sunday August 07 2016, @09:02PM (#385043) Homepage

For a different anecdotal viewpoint, I have been using BTRFS (without RAID) for six months now and have had zero issues, with constant writing and deleting of large media files with LUKS encryption.

--
Join the SDF Public Access UNIX System today!

Parent
- Re:Title needs correction(Score: 5, Interesting) by rleigh on Sunday August 07 2016, @09:34PM
  
  by rleigh (4887) on Sunday August 07 2016, @09:34PM (#385051) Homepage
  
  This is a "relatively safe" mode of operation, since you're not using any of the RAID code, and it's also much better tested in this mode. But, the thing with filesystems is that it's all fine when there are no problems, since if there are no faults you're not exercising any of the failure codepaths. The true test is when you have hardware glitches on the disk, faulty connectors, or memory errors etc. It's these failure codepaths which are where Btrfs fails so spectacularly, the very parts we absolutely require to work correctly. And having been exposed to a whole bunch of them, I can tell you right now that it's unsuitable for production use if you really care about your data, because relying on Btrfs is like playing Russian roulette! Lightweight use will *seem* OK, and for the most part *will* be OK. Start loading it with multiple parallel users, concurrent snapshots, and it will fail *really* quickly: total time from clean new filesystem to broken and unbalanced for me, the record is: every 18 hours with a few tens of concurrent snapshots and jobs. This is utterly thrashing the filesystem, so with lighter use that time will be significantly more than 18 hours: a week, several months or several years. But the fact that it will simply stop working at some undefined future point is absurd. That's not hardware related; it's a massive design and implementation *fail*. I can thrash an ext3, xfs or other "simple" filesystem in a similar manner for *years* with every expectation that it will continue to work absent any hardware problems. Filesystems demand perfection in their implementation more than any other piece of software; we entrust them with our data. Btrfs has failed at this, badly, ever since its creation. I've been using it since the start. We hoped it would stabilise. It didn't. It's still not trustworthy, and "no issues for six months" really means little when you look at the shocking flaws in it.
  
  Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Btrfs RAID 5/6 Code Completely Hosed

Title needs correctionTitle needs correction (Score: 2, Interesting) by Anonymous Coward on Sunday August 07 2016, @03:30PM

Re:Title needs correctionRe:Title needs correction (Score: 1, Informative) by Anonymous Coward on Sunday August 07 2016, @06:01PM

Re:Title needs correctionRe:Title needs correction (Score: 2) by rleigh on Sunday August 07 2016, @06:48PM

Re:Title needs correctionRe:Title needs correction (Score: 0) by Anonymous Coward on Sunday August 07 2016, @08:54PM

Re:Title needs correction(Score: 0) by Anonymous Coward on Monday August 08 2016, @10:45PM

Re:Title needs correction(Score: 0) by Anonymous Coward on Sunday August 07 2016, @09:12PM

Re:Title needs correction(Score: 2) by rleigh on Sunday August 07 2016, @06:23PM

Re:Title needs correction(Score: 3, Interesting) by frojack on Sunday August 07 2016, @06:49PM

Re:Title needs correctionRe:Title needs correction (Score: 2) by darkfeline on Sunday August 07 2016, @09:02PM

Re:Title needs correction(Score: 5, Interesting) by rleigh on Sunday August 07 2016, @09:34PM