Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Monday August 28 2017, @10:08AM   Printer-friendly
from the a-bitter-fs dept.

Submitted via IRC for TheMightyBuzzard

SUSE has decided to let the world know it has no plans to step away from the btrfs filesystem, and plans to make it even better.

The company's public display of affection comes after Red Hat decided not to fully support the filesystem in its own Linux.

Losing a place in one of the big three Linux distros isn't a good look for any package even if, as was the case with this decision, Red Hat was never a big contributor or fan of btrfs.

[Matthias G. Eckermann] also hinted at some future directions for the filesystem. "We just start to see the opportunities from subvolume quotas when managing Quality of Service on the storage level" he writes, adding "Compression (already there) combined with Encryption (future) makes btrfs an interesting choice for embedded systems and IoT, as may the full use of send-receive for managing system patches and updates to (Linux based) 'firmware'." ®

Mmmmmm... butter-fs

Source: https://www.theregister.co.uk/2017/08/25/suse_btrfs_defence/


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by KiloByte on Monday August 28 2017, @10:58AM (17 children)

    by KiloByte (375) on Monday August 28 2017, @10:58AM (#560151)

    While btrfs violates KISS to a massive degree, even just data checksums alone mean you shouldn't even consider using anything but btrfs or ZFS, unless you don't care for correctness of your data at all or have another means of in-line verification.

    It's like ECC memory, except that disks lie a lot more than memory does. HDDs lie. Controllers lie. Cables lie. SD cards massively lie. eMMC lies.

    In theory, disks are supposed to use complex erasure codes themselves that should make such corruption impossible, but that's theory not practice.

    Again and again I see someone come to IRC, say "btrfs is shit", which after troubleshooting ends with "oh, indeed my machine was overheating" or "memtest86 worked well for hours but after two days it started throwing errors". Btrfs detects errors that ext4 or xfs are oblivious to.

    And checksums are not the only data safety feature of btrfs. Yes, it's slower because on HDD it duplicates metadata by default. And never overwrites data in-place.

    Or, do you want hourly O(changes) backups on a filesystem that rsync needs half an hour just to stat?

    --
    Ceterum censeo systemd esse delendam.
    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Informative=2, Total=3
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 0) by Anonymous Coward on Monday August 28 2017, @12:29PM (11 children)

    by Anonymous Coward on Monday August 28 2017, @12:29PM (#560188)

    unless you don't care for correctness of your data at all or have another means of in-line verification.

    ... uh

    It's like ECC memory, except that disks lie a lot more than memory does.

    CRC and other checksums don't lie, they have hash collisions. Bigger hashes means less chance of collision. The biggest checksum is a copy of the data itself. That's why RAID exists. I prefer to use a hardware solution like RAID5 and a journaling FS like EXT3 / 4 than a chain of hash fingerprints that if corrupted themselves lie about the entire subsequent dataset...

    • (Score: 1, Interesting) by Anonymous Coward on Monday August 28 2017, @12:40PM

      by Anonymous Coward on Monday August 28 2017, @12:40PM (#560194)

      If the simple hash had a collision, would the raid be consulted? Don't think that usually happens.

    • (Score: 3, Interesting) by KiloByte on Monday August 28 2017, @01:00PM (8 children)

      by KiloByte (375) on Monday August 28 2017, @01:00PM (#560209)

      CRC and other checksums don't lie, they have hash collisions.

      Hash collisions are a concern against a malicious human, here we're defending against faulty cable/too hot motherboard/bogus RAM/shit disk firmware/random at-rest corruption/random in-transit corruption/evil faeries shooting magic dust pellets at your disk.

      Bigger hashes means less chance of collision.

      They are also much slower to compute and take more space to store. It's a tradeoff with an obvious choice here: an attacker who can write to the block device can trivially generate bogus data with a valid checksum, thus a bigger hash gives us almost nothing.

      The biggest checksum is a copy of the data itself. That's why RAID exists. I prefer to use a hardware solution like RAID5 and a journaling FS like EXT3 / 4 than a chain of hash fingerprints that if corrupted themselves lie about the entire subsequent dataset...

      RAID solves a different, related problem: it helps you recover data once you know it's bad. It doesn't help with noticing failure (well, you can have a cronjob that reads the entire disk, but by that time you had a week of using bad data) -- and even worse, without the disk notifying you of an error you don't even know which of the copies is the good one.

      By "lies" in the GP post I meant "gives you bad data while claiming it's good". A disk may or may not notice this problem -- if it doesn't, the RAID has no idea it needs to try the other copy.

      --
      Ceterum censeo systemd esse delendam.
      • (Score: 2, Interesting) by shrewdsheep on Monday August 28 2017, @01:34PM (7 children)

        by shrewdsheep (5215) on Monday August 28 2017, @01:34PM (#560222)

        Hash collisions are a concern against a malicious human, here we're defending against faulty cable/too hot motherboard/bogus RAM/shit disk firmware/random at-rest corruption/random in-transit corruption/evil faeries shooting magic dust pellets at your disk.

        CRC guarantees detection of bit-flips up to a certain number (depending on size of CRC). After that, there is a probability of collision which seems low enough as long as independent bit flips are assumed. In practice, bit-flips are not independent but highly correlated (whole block going bad, ...; same problem with ECC). My take is to be careful about CRC and complement it with different measures. I do software RAID1 (based on discussions I had here and elsewhere) and I never overwrite backup-drives (they get retired once full to serve an extra layer of safety).

        On the more philosophical side I am amazed how well computers do work. To allow error free operation, probability of errors on computation/storage have to be incredibly low (let us say per period of time or per operation). So low in fact that I cannot believe them to be such (let us say 10E-15 - 10E-18 per operation, be it compute or read/write). So there should be errors accumulating in our data but I could not say I have noticed.

        • (Score: 2) by LoRdTAW on Monday August 28 2017, @02:41PM (6 children)

          by LoRdTAW (3755) on Monday August 28 2017, @02:41PM (#560253) Journal

          On the more philosophical side I am amazed how well computers do work.

          Honestly, I sometimes consider these things magic. I know how they work down to the silicon. But I'm still amazed how far we have come in shrinking down, speeding them up, while packing in the features and they still work >99.9% of the time. And lets not forget the millions of lines of code they jog through on a daily basis to do the most mundane of tasks like give us the time, weather, play music or watch TV.

          Just got my 500GB NVME disk and a PCIe adapter. I couldn't help but marvel over the tiny Bic lighter sized board. It holds over a million times the amount of data on a 360kb 5.25" floppy or 25000 times as much data on the ancient 20MB hard disk in my first 8086 PC. And the bandwidth is staggering. I ripped a very old 2GB WD IDE disk the other day using a pentium 2 400. Whopping 2.7MB/sec using dd through netcat to another Linux box. 500 MBps from my SATA SDD to the NVME disk and that's not even a third of the NVME write speed while reads are twice that speed! Progress certainly is amazing.

          • (Score: 2) by frojack on Monday August 28 2017, @07:10PM (5 children)

            by frojack (1554) on Monday August 28 2017, @07:10PM (#560413) Journal

            I'm more amazed ad how FEW actual data errors we actually experience.

            I don't buy the nonsense that everything except btrfs experiences silent data loss and we are none the wiser.
            Programs would crash and burn. The kernel would panic, data corruption would be rampant.

            But all those things are just rare as hell, or caught and handled automatically. KiloByte's dire predictions have the distinctive ring of bullshit to me.

            I've lost more data to btrfs (yup - opensuse) and had to reinstall entire systems due to btrfs than all the other file systems I've ever used. (And yes, I still have some Riserfs systems in play).

            --
            No, you are mistaken. I've always had this sig.
            • (Score: 2) by KiloByte on Monday August 28 2017, @07:26PM (2 children)

              by KiloByte (375) on Monday August 28 2017, @07:26PM (#560428)

              Most errors are not noticeable. You'll have a mangled texture in a game, a sub-second series of torn frames in a video, a slight corruption of data somewhere. A bad download you'll retry without investigating the issue, etc.

              On your disk, how much do all executables+libraries take together? It's 4GB or so. The vast majority of it is data.

              Being oblivious to badness makes people happy. I don't want a false sense of happiness.

              --
              Ceterum censeo systemd esse delendam.
              • (Score: 3, Insightful) by frojack on Monday August 28 2017, @10:47PM (1 child)

                by frojack (1554) on Monday August 28 2017, @10:47PM (#560564) Journal

                Easy for you to rev up the fud. Hard for you to produce any actual statistics.
                CERN tried, but 80% of the errors they found were attributable to a mismatch between WD drives and 3Ware controllers,
                and others pointed out the them that they could never separate those errors from any of their other detected errors.

                How many times has you BTRFS system detected and/or prevented errors?

                --
                No, you are mistaken. I've always had this sig.
                • (Score: 2) by KiloByte on Tuesday August 29 2017, @07:28PM

                  by KiloByte (375) on Tuesday August 29 2017, @07:28PM (#561043)

                  5 out of 8 disk failures I had in recent months involved a silent error. Four of those had nothing but silent errors, while fifth had a small number of loud errors plus a few thousand of silently corrupted blocks. Of the remaining failures, one was a FTL collapse (also silent, but quite hard to miss), while two were HDDs dying completely.

                  Thus, every single of failures that didn't take the whole disk involved silent data corruption.

                  Only one of those had ext4, and I wasted a good chunk of time investigating what's going on -- random segfaults didn't look like a disk failure in the slightest. The others had btrfs; they were either in RAID or I immediately knew what exactly was lost -- and more crucially, no bad data was ever used or hit the backups.

                  Two days ago, I also had a software failure by stupidly (successfully) trying to reproduce a fandango-on-core mm kernel bug (just reverted by Linus) on my primary desktop with one of loads being disk balance. All it taken to find corrupted data was to run a scrub, all 30ish scribbled upon extents luckily were in old snapshots or a throw-away chroot so I did not even have to restore anything from backups. The repair did not even require a reboot (not counting one after the crash, obviously).

                  By "disk" in the above I mean: 1 SD card (arm), 1 eMMC (arm), 1 SSD disk (x86), the rest spinning rust (x86).

                  --
                  Ceterum censeo systemd esse delendam.
            • (Score: 2) by LoRdTAW on Monday August 28 2017, @09:49PM

              by LoRdTAW (3755) on Monday August 28 2017, @09:49PM (#560541) Journal

              That's the magic. Almost no data loss. I've lost more data to dumb mistakes (# dd if=... of=/dev/sdc↵ ... fuuuuuuuuuk that was the wrong disk!) than hardware failure or bit rot. In fact, I think I see less bit rot nowadays than maybe 15+ years ago. I remember dealing with more corrupt compressed archives and video files on older 250GB ATA disks than I do now. Those kinds of files see a lot of sitting and rot away. Does that mean I am more confident? No. Still a bit paranoid and I keep backups.

              As for BTRFS. I have yet to give it a go. I like the idea of a libre file system with all the goodies like built in RAID and data verification. But without it being blessed as stable and production ready by the devs, I'm just not interested.

            • (Score: 2) by TheRaven on Tuesday August 29 2017, @08:17AM

              by TheRaven (270) on Tuesday August 29 2017, @08:17AM (#560722) Journal
              One of my colleagues did some work a few years ago on using single-bit flips to escape from a JVM sandbox. The basic idea is that you create a pattern in memory where there's a high probability that a memory error will allow you to make the JVM dereference a pointer that allows you to escape. You then try to heat up the memory chips until it will happen. For some of the experiments, they tried pointing a hairdryer at the RAM chips to make the bit flips happen faster. The most interesting thing with that part of the experiment was how high you could get the error rates before most software noticed.
              --
              sudo mod me up
    • (Score: 1, Informative) by Anonymous Coward on Monday August 28 2017, @02:09PM

      by Anonymous Coward on Monday August 28 2017, @02:09PM (#560239)

      Properly done ECC will correct itself, that is what ECC means (error correcting code), this is just as true for errors in the check part as the data.

      Yes there is more data to be corrupted, but you gain the ability to correct it, and detect that it needs correcting.

  • (Score: 0, Informative) by Anonymous Coward on Monday August 28 2017, @12:49PM (4 children)

    by Anonymous Coward on Monday August 28 2017, @12:49PM (#560200)

    BTRFS is trash. I still don't trust it.

    Relatively recently - mid-last year - fired up a new Arch build as a backup target, decided to give btrfs a try - most people were saying it was stable enough to try. Formatted, got a bit fancy with multi-disk and resilience, everything looks good, started backing up, dies part way through (Veeam backing up to Samba CIFS says its lost contact with the target). Machine unresponsive even from iLO. Reboot, everything just backed up is gone (empty volume) along with anything left over in tmp locations being corrupt and full of junk.

    Reformat with a nice stable server-distro (CentOS7), btrfs still marked as experimental and this is a different kernel. Try a nice simple btrfs volume setup without fanciness, similar results.

    XFS works brilliantly (and I've rarely had issues with ext4). The backup target is outfitted with proper ECC RAM, enterprise SAS HDDs, iLO diagnostics and so on - it's not a dodgy workstation or NAS. Veeam and pretty much anything else I throw at it can validate their chain signatures from beginning to end no worries across terabytes of data sitting for up to several years.

    I could forgive an experimental filesystem being experimental. I could probably forgive some dodgy experimental features on a stable base that were clearly marked. I could get some useful diagnostics if the doco was helpful and faults provided useful debugging info. The documentation for btrfs doesn't clearly mark the parts that (if you search for long enough) you discover aren't recommended or stable, and basic operation of the FS is problematic.

    I really wanted to like btrfs (it was perfect for my use case) but I can't trust it so I won't use it.

    • (Score: 4, Interesting) by KiloByte on Monday August 28 2017, @01:22PM (3 children)

      by KiloByte (375) on Monday August 28 2017, @01:22PM (#560217)

      a nice stable server-distro (CentOS7)

      And here's your problem. They use Red Hat's frankenkernels which are notoriously bogus: they mix a truly ancient base (3.10 for 7, 2.6.32 for 6) with backported features from the bleeding edge. Linux' I/O code moves pretty fast, patches tend to have nasty conflicts even three versions apart -- trying to backport complex code from 4.9 to 3.10 just can't possibly end well. And, despite trying something as risky, Red Hat doesn't even have a single btrfs-specific engineer. No wonders it keeps breaking.

      --
      Ceterum censeo systemd esse delendam.
      • (Score: 0) by Anonymous Coward on Monday August 28 2017, @02:52PM

        by Anonymous Coward on Monday August 28 2017, @02:52PM (#560260)

        I wasn't aware Poettering was in charge of the kernel too...

      • (Score: 3, Informative) by hoeferbe on Monday August 28 2017, @04:06PM (1 child)

        by hoeferbe (4715) on Monday August 28 2017, @04:06PM (#560290)
        KiloByte [soylentnews.org] wrote [soylentnews.org]:

        They use Red Hat's frankenkernels which are notoriously bogus: they mix a truly ancient base (3.10 for 7, 2.6.32 for 6) with backported features from the bleeding edge.

        You have an interesting characterization of Red Hat's compatibility policy [redhat.com].  I believe many of Red Hat's customers appreciate the stability that provides.  As can be seen from Red Hat Enterprise Linux's life cycle [redhat.com], new features and significantly new hardware support is only provided in the first 5.5 years of its 10 year life cycle.  (I.e. "Production Phase 1".)  This means new features are not being added to RHEL6.  They are being added to RHEL7's Linux kernel, but I would not go so far as to say 3.10 (which came out in 2013 and is a "longterm" release [kernel.org]) is "ancient".

        Linux' I/O code moves pretty fast, patches tend to have nasty conflicts even three versions apart -- trying to backport complex code from 4.9 to 3.10 just can't possibly end well.

        And yet Red Hat appears to be successful at it!  ;-)

        Red Hat doesn't even have a single btrfs-specific engineer.

        I found this Hacker News thread on Red Hat's Btrfs notice [ycombinator.com] informative since it came from a former Red Hat employee.

        • (Score: 2) by sjames on Monday August 28 2017, @08:15PM

          by sjames (2882) on Monday August 28 2017, @08:15PM (#560467) Journal

          On the other hand, Debian stable isn't known for moving at breakneck speed either but it supports BTRFS just fine.

          RH's continuing dedication to xfs is a bit puzzling as well. xfs made sense in the '90s on IRIX back when SGI was doing heavy video work on hardware where disk I/O was barely up to the task. In that environment, software willing to go out of it's way to work with the FS to maximize I/O could see substantial performance gains. None of that really matters much now, especially since the most interesting features had a hard requirement on specialized hardware that doesn't exist on most servers today. Further, even with the special hardware, the features were not ported to the Linux implementation anyway.

          COW really looks like the future in file systems. It's early(-ish) days for BTRFS, but it really is a credible choice these days as long as you stay away from raid > 1 and according to some, zlib compression (lzo is fine). I could understand if RH wanted to make a play with ZFS instead of BTRFS, but abandoning all of it to keep pushing XFS makes no sense.