"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too! So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
In the 1990s, I worked for a vendor that made servers, large and small, and quite a few incidents were big wins ...
1. A client complained that their UNIX V.3 machine stopped accepting any kind of data entry, and raised hell to the manager. So I was called to go and investigate on site. They restored the backup, and started entering the invoices into the system (was an RM/COBOL application). When the system stopped, I found that a data file was exactly 32MB (or some round number like that). When I looked at their shell session, ulimit was set to that exact number, to prevent runaway processes from eating up disk space. I changed the limit in their shell .profile or somesuch, and it worked from the first try. The client was so thankful that he sent a glowing thank you letter to the world headquarters of the company.
2. Another client (a bank) who just converted from old UNIX System V.3 to UNIX SVR4 complained that CPU utilization spikes mid-day around 11 am when customers rush in for transactions. I used truss [idevelopment.info] (the SVR4 equivalent of strace [wikipedia.org]) to monitor their application (also written in RM/COBOL). I found that it returned busy because a file was locked. Their application, instead of returning an error to the user, went into a loop retrying the operation, only to find it locked and retry again, eating up CPU time. I informed the developers of the issue, and they changed the code to sleep for a second before retrying, and the problem went away.
3. A client had a Decision Support System (DSS) written in Visual Basic, querying their datawarehouse server than ran Teradata [wikipedia.org], which is a massively parallel database just for DSS apps. The VB app allowed them to do ad-hoc queries for a very large data set. When the app was launched, they found out that a crucial query was taking hours and not responding back. When I investigated, it turns out the Microsoft's Jet Engine database layer wanted to retrieve the entire tables locally to the Windows PC's memory, then do the database JOIN locally. Problem is: table was too big, and the connection was on a 64kbps leased line! The solution I devised was to ditch Microsoft's Jet Engine, and go directly to the Teradata ODBC layer. That way, the JOINs were done inside the database as they should be, and the query took minutes to execute.
4. There was another bug that I do not remember, but had to trouble shoot it remotely over the phone with someone who does not speak English. He was reading back the English output describing the shape of the letters, and I instruct him to type the Arabic keys that had the letters I wanted typed (e.g. ls -l). It worked to my surprise, and I was lucky that we both had keyboards with the same Arabic layouts, something rare in those days.