"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too! So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
Due to its intermittent nature this one took a few days to get to grips with. One of my customers is a small/medium sized business. A few years ago their server would go off the air sometime after 11am, but possibly as late as close-of-business, almost daily. After eliminating server hardware and software and the network switch as the problem the only thing left was the cable in the wall. I lay an untidy 20M cat5 between rooms and sure enough the problem seemed to go away. They brought in a cable guy who found the interesting truth - the drain hole of an airconditioning unit was partially blocked - after a few hours, and depending on humidity, the aircon would would start dripping into the wall space directly onto the network plug.
I have similar software experiences, but they don't really make good stories. Probably my favourite story isn't a bug and barely qualifies as a hack, and it's quite old. I was dealing with hardware that didn't allow network booting/provisioning because of no boot ROM. I got around this by adding a small boot partition containing a network-enabled Grub ( /w software boot ROM) which pulled its configuration from a network location. I usually had the machines booting locally, but I could also tell individual machines to image themselves... I also had the machines default to booting locally if the network failed. This was NOT the way Grub network support was intended to be used... it was basically to allow a network-capable Grub bootmenu which had already been bootstrapped via PXE. I guess I like it because it saved me a LOT of work. BTW, is there any hardware these days that doesn't come with network boot code? Probably Raspberry Pi at least, but they're ARM-based and Grub wouldn't run on them.