martyb writes:
"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too!
So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
(Score: 1) by sjames on Sunday March 09 2014, @07:35AM
First one in the mid '80s. We were using a cluster of 4 PCs connected by LANtastic (yes, LANtastic) to sort large (for the time) database indexes. The system would split the unsorted data into chunk files small enough to be held in memory, quick sort each chunk, and then perform a distributed merge using mailbox files to coordinate the sort.
One day, an index is out of order post sorting. Two entries are transposed. The good news is that everything is deterministic and we still had the input, so it should be easy enough to re-produce. However, after watching those two entries go through the whole process, it comes out in perfect order. Running the input again without debugging produces a perfectly ordered index as well. Many re-runs with that dataset and others, including pathological inputs all come out fine.
Conclusion? Random bit flip in a CPU flag. The bug never happened again.
Next up, debugging LinuxBIOS (now Coreboot) on a new MB. Serial port isn't coming up and post codes aren't making it to the PCI bus. After a bit of testing, I find that I can toggle the power light by frobbing a couple bits I can reach. Devise a blink code to get minimal debugging info until I get the serial port up.