Stories
Slash Boxes
Comments

SoylentNews is people

posted by LaminatorX on Sunday December 21 2014, @02:05AM   Printer-friendly
from the fsking-pid0 dept.

A Debian user has recently discovered that systemd prevents the skipping of fsck while booting:

With init, skipping a scheduled fsck during boot was easy, you just pressed Ctrl+c, it was obvious! Today I was late for an online conference. I got home, turned on my computer, and systemd decided it was time to run fsck on my 1TB hard drive. Ok, I just skip it, right? Well, Ctrl+c does not work, ESC does not work, nothing seems to work. I Googled for an answer on my phone but nothing. So, is there a mysterious set of commands they came up with to skip an fsck or is it yet another flaw?

One user chimed in with a hack to work around the flaw, but it involved specifying an argument on the kernel command line. Another user described this so-called "fix" as being "Pretty damn inconvenient and un-discoverable", while yet another pointed out that the "fix" merely prevents "systemd from running fsck in the first place", and it "does not let you cancel a systemd-initiated boot-time fsck which is already in progress."

Further investigation showed that this is a known bug with systemd that was first reported in mid-2011, and remains unfixed as of late December 2014. At least one other user has also fallen victim to this bug.

How could a severe bug of this nature even happen in the first place? How can it remain unfixed over three years after it was first reported?

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Interesting) by tftp on Sunday December 21 2014, @02:29AM

    by tftp (806) on Sunday December 21 2014, @02:29AM (#127877) Homepage

    In 2000's I was running quite a lot of Linux, and this problem was always bothering me. It looked like the computer's needs are above user's needs. To compare, Windows never ran fsck unless the FS was in a pretty bad shape. These days I have either servers (Ubuntu LTS) that aren't frequently rebooted, or desktops (Mint) in VM that never need fsck. Maybe journaling FS help here, as integrity of the FS can be easily determined without going through terabytes of data. Running the check on a modern HDD may be an hour-long distraction, and most certainly the OS should not run it without a positive confirmation.

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  

    Total Score:   2  
  • (Score: 2) by sjames on Sunday December 21 2014, @04:39AM

    by sjames (2882) on Sunday December 21 2014, @04:39AM (#127922) Journal

    How is it that you KNOW that no bit got flipped anywhere on the drive?

    Out of an abundance of caution, Linux wants to do a full fsck periodically. If you don't want that, you can disable the periodic fsck and do it manually when you want. Up until systemd reared it's ugly head, you had the ability to cancel the fsck if you wanted.

    • (Score: 1) by tftp on Sunday December 21 2014, @05:41AM

      by tftp (806) on Sunday December 21 2014, @05:41AM (#127931) Homepage

      How is it that you KNOW that no bit got flipped anywhere on the drive?

      Drives don't flip bits for no reason. Every data record (sector) has a checksum. If the sector reads OK but a bit is flipped, it's because you wrote it flipped. Don't do that.

      One of primary reasons for the need of fsck in days of ext2 was absence of the journal. This means that abrupt reset of the computer could wreck the FS. The same was the norm in days of Windows 95 (FAT). However ext3/ext4 on Linux side (not even mentioning other journaling FS) and NTFS on Windows reduced the need of full scan because everything that one needs to know for recovery is, generally, in the journal. Forcing a check of the filesystem every so many boots is just demonstrating either extreme paranoia, or lack of trust in the FS code. Perhaps back then, 15 years ago, it was a reasonable measure, considering the state of the filesystems in Linux. It is strange to encounter such a thing today. How often does a modern Windows box force you to rescan the HDD? IIRC, it happens only if an error is detected during normal operation - and then rescan is scheduled on next reboot. And yes, you can cancel it :-)

      • (Score: 2) by sjames on Sunday December 21 2014, @06:37AM

        by sjames (2882) on Sunday December 21 2014, @06:37AM (#127940) Journal

        They don't flip for no reason, they flip for a variety of reasons. Spike in vibration while writing an adjacent track, EMI in the drive cable, stray cosmic ray, power glitches, etc.

        In SOME of those cases the checksum doesn't match, but you won't know that unless something like a full fsck comes along and detects it for you. In others, the checksum will have been generated AFTER the data got corrupted and so it will perfectly validate the incorrect metadata. Fsck can sanity check that for you.

        Journaling in the file system is a great advancement, but it isn't a panacea. Paranoid? Perhaps, but we're talking about servers, not glorified solitaire and minesweeper machines :-)

        But Linux does respect user choice. You can use the -c and -i options of tune2fs to change the fsck interval or disable periodic fsck entirely. As long as you avoid systemd you can cancel an fsck if it's not a good time. You can manually command it to do the fsck and reset the countdown.

        • (Score: 1) by tftp on Sunday December 21 2014, @06:55AM

          by tftp (806) on Sunday December 21 2014, @06:55AM (#127943) Homepage

          Paranoid? Perhaps, but we're talking about servers, not glorified solitaire and minesweeper machines :-)

          I agree about servers; but I recall that the darned thing activated when I powered up the laptop at a meeting :-) Don't even remember what it was, SUSE or RedHat.

          In others, the checksum will have been generated AFTER the data got corrupted and so it will perfectly validate the incorrect metadata.

          Checksums are calculated and checked by the hardware. The data rate is way too high to do it in a CPU. Besides, it has to be done atomically, on sector level, when HDD rewrites a sector. But sure, there is always a possibility to screw things up. It may be reasonable to be extra careful on servers. However desktops are just fine with lazy verification.

          • (Score: 2) by sjames on Sunday December 21 2014, @09:17AM

            by sjames (2882) on Sunday December 21 2014, @09:17AM (#127973) Journal

            I agree about servers; but I recall that the darned thing activated when I powered up the laptop at a meeting :-) Don't even remember what it was, SUSE or RedHat.

            An excellent example of why fsck might need to be cancelled.

            You know which use case best applies to your machine, so you must configure it if you want lazy verification.

      • (Score: 2) by Immerman on Sunday December 21 2014, @06:45AM

        by Immerman (3985) on Sunday December 21 2014, @06:45AM (#127941)

        >Drives don't flip bits for no reason.

        Of course they do - hence the name "random error". Back in the day you could expect an error on average in one bit out of every 10^14 bits read. Roughly one byte in in 10 terabytes - not so bad when in the day when that was a a ferocious amount of data transfer for a 100MB drive at 50mb/s. Today though, it probably rears it's head a few times in the lifetime of a 1TB drive. And that's traditional, simple, crude even HD technology. Newer tech... well a lot o it hasn't even been out long enough to determine realistic real-world error rates. To say nothing of SSDs - where I just found numbers in the 1 in 10^8 range. That's 1 bit in every 100MB, or one error every few seconds if you're saturating the SATA bus. Seems ridiculous to me, so presumably there's a lot of error correction going on behind the scenes, but even that has it's limits.

        • (Score: 3, Interesting) by sjames on Sunday December 21 2014, @07:44AM

          by sjames (2882) on Sunday December 21 2014, @07:44AM (#127952) Journal

          A simple read error is not a big deal except sometimes for performance, just read again until the checksum verifies the data. Often that happens at the hardware level. Write errors or bit flipping on the media IS a big deal since that is actually corrupt data. Worst of all is when the flipped bit happens prior to checksumming. That can cause silent corruption.

          That's why BTRFS and ZFS do checksumming at the file system level and support raid-like storage. Unlike a device level RAID, they can decide which disk has the correct data in cases of bit flip and can then re-write the correct data to the other drive. Even on a single drive, btrfs likes to write two copies of the metadata.

        • (Score: 2) by FatPhil on Sunday December 21 2014, @08:44AM

          by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Sunday December 21 2014, @08:44AM (#127965) Homepage
          1TB is ~10^13 bits. A 10^-14 error will happen every tenth time you do a full backup. If hard disks get much larger they're going to have to start incorporating much fancier error detection/correction.
          --
          Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 2) by cafebabe on Thursday December 25 2014, @08:33AM

          by cafebabe (894) on Thursday December 25 2014, @08:33AM (#129060) Journal

          A similar problem occurs with networking. For a few packets over a few segments, 16 bit checksums may be sufficient. However, 0.01% corruption of Jumbo Frames over 13 hops leads to silent corruption approximately once per hour. Worse links or small packets may lead to significantly higher rates of corruption.

          presumably there's a lot of error correction going on behind the scenes, but even that has it's limits.

          Unfortunately not. One person's payload is another person's header. So, if you aren't processing the payload immediately, the corruption is silent. Even if you are processing the payload immediately, corruption may elude validation or parsing.

          --
          1702845791×2
      • (Score: 2) by FatPhil on Sunday December 21 2014, @08:27AM

        by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Sunday December 21 2014, @08:27AM (#127960) Homepage
        You appear to not consider a sector failing to read a problem.

        In that case - turn of filesystem checking completely. Good luck, and enjoy your bitrot.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 2) by cafebabe on Thursday December 25 2014, @08:31AM

        by cafebabe (894) on Thursday December 25 2014, @08:31AM (#129059) Journal

        Drives don't flip bits for no reason. Every data record (sector) has a checksum.

        That isn't an end-to-end checksum. Also, ATA drives autonomously substitute sectors when regenerative checksums exceed a threshold. What algorithm and threshold? That is proprietary and varies with each revision of firmware on each model of each manufacturer's drives.

        How often does a modern Windows box force you to rescan the HDD?

        After every Blue Screen Of Death.

        --
        1702845791×2
    • (Score: 0) by Anonymous Coward on Sunday December 21 2014, @09:53PM

      by Anonymous Coward on Sunday December 21 2014, @09:53PM (#128134)
      Irrelevant. If you're using periodic fscks on (rare) reboots to detect bits flipped on your drives, you're doing things badly wrong. fsck checks for file system integrity. Not bits flipped anywhere on the drive.

      FWIW extensive checking for file system integrity during reboots just because X weeks have passed is also a stupid thing to be doing.

      If there is a reason that you need to check your filesystem for problems every X weeks or months, you shouldn't be waiting for a reboot after Y months to do so. There's no strong correlation between X and Y.
      • (Score: 2) by sjames on Monday December 22 2014, @01:17AM

        by sjames (2882) on Monday December 22 2014, @01:17AM (#128190) Journal

        Bits flipped in the metadata definitely affect filesystem integrity. The whole scheme is a bit from a previous era. The trend now is towards filesystem level checksums and online integrety checking (beyond journaling).

        • (Score: 0) by Anonymous Coward on Monday December 22 2014, @12:35PM

          by Anonymous Coward on Monday December 22 2014, @12:35PM (#128296)
          You said "How is it that you KNOW that no bit got flipped anywhere on the drive?"

          Not "How is it that you know that no bit got flipped in the filesystem METADATA on the drive"

          Big difference.
          • (Score: 2) by sjames on Tuesday December 23 2014, @09:15AM

            by sjames (2882) on Tuesday December 23 2014, @09:15AM (#128619) Journal

            It only matters to your argument if I had said "How is it that you know that no bit got flipped in the filesystem anywhere but in the metadata". I didn't. The whole disk includes the metadata, yes?

    • (Score: 2) by Geotti on Monday December 22 2014, @02:07AM

      by Geotti (1146) on Monday December 22 2014, @02:07AM (#128199) Journal

      How is it that you KNOW that no bit got flipped anywhere on the drive?

      There's like 10 posts in this tree, but no one suggested ZFS to limit the repercussions of a flipped bit. Makes me wonder...

      • (Score: 2) by Geotti on Monday December 22 2014, @02:09AM

        by Geotti (1146) on Monday December 22 2014, @02:09AM (#128200) Journal

        Whoops, I overlooked http://soylentnews.org/comments.pl?sid=5378&cid=127952 [soylentnews.org] by sjames. *blush*

      • (Score: 0) by Anonymous Coward on Monday December 22 2014, @04:21AM

        by Anonymous Coward on Monday December 22 2014, @04:21AM (#128223)

        ZFS is a pain in the ass to use with Linux. It's not like Solaris or FreeBSD, where it's pretty much seamless.

      • (Score: 2) by cafebabe on Thursday December 25 2014, @02:54PM

        by cafebabe (894) on Thursday December 25 2014, @02:54PM (#129105) Journal

        The gains from using ZFS on one drive are fairly small. Yes, you'll have end-to-end checksums However, it may be preferable to suffer a default filing system and a periodic check.

        --
        1702845791×2
      • (Score: 2) by cafebabe on Monday December 29 2014, @06:04PM

        by cafebabe (894) on Monday December 29 2014, @06:04PM (#130001) Journal

        Not many people are running software RAID and ZFS isn't a huge gain with a single volume. Admittedly, ZFS offers end-to-end checksums but it seems people would rather incur occasional integrity checks rather than incur upgrade, compatibility or recovery problems from something more transparent.

        --
        1702845791×2
  • (Score: 3, Insightful) by VLM on Sunday December 21 2014, @12:41PM

    by VLM (445) on Sunday December 21 2014, @12:41PM (#127999)

    Running the check on a modern HDD may be an hour-long distraction

    The "linux people" have moved on to freebsd, where "zpool scrub zroot" runs in the background. Also on my SSD desktop with only a couple gigs of stuff, a scrub only takes a minute or so. We'll call it 20 seconds per gig. "Modern HDD" is an oxymoron other than multi-terabyte beasts that live in a fileserver at home or the NAS at work. I would have to check but I think I'm down to 4 spinning rust drives at home, 3 multi-terabyte beasts in the fileserver and one in the kids xbox.

    The "enterprise java windows developers" have all taken over linux now and they're spreading their views over linux rather than merging into the community. So you get folks who think the design of systemd is a great idea, etc. We're stuck with windows ME edition thinking, running on a otherwise decent linux kernel.