"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too! So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
In the 90s, I was responsible for getting a roughly 250 kLOC MacApp 3 software for medical device control stable. A few crashers were quite hard to nail down. The one I'm most proud of, I guess, was a bug in Script Manager. Under certain circumstances, we saw seemingly random memory get overwritten. I narrowed it to a specific call and drilled down into the OS assembly with MacsBug. Turned out Script Manager wrote to an address in a register it did not touch before. I found a workaround and reported it. Got me a very nice reply from DTS for the "hard work in MacsBug" :)
Another hard one where I can only claim an "assist" was with SetCursor(), which was "guaranteed" interrupt-safe and MacApp switched on the watch in an interrupt, given that guarantee. However, the Control Strip patched SetCursor and went on to do interrupt-unsafe memory handling. This led to very rare crashes. A bit of back and forth exchange with DTS led to nothing. Then I noted that some adresses pointed to a pattern that looked suspiciously like a bitmap, and after drawing it down by hand, I figured out that it was one of the mouse cursors. With that info, I reported back again, and after a few days DTS responded that they found the offender.
Honourable mention goes to a Linux Kernel (2.4.2x) issue where jffs2 would ignore a readonly-mount flag, which we noted when checksums for a supposed-to-be-read-only boot file system changed. Tracking the bug down that wasn't that hard, and the maintainer (David Woodhouse) went for a slightly bigger scope solution than my mailed-in patch, but I got credits in the change log for the Kernel :)
And right now, I've got my teeth bitten into what seems like a cache coherency / TLB integrity issue with a softcore CPU (Microblaze). It seems to read a wrong value from a SYSV shared memory area about once every 5 minutes, and only in a very specific application software setup. If my analysis so far is correct (I can _most_ likely rule out a race) and I get to fix it, that will definitely make this list :)