"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too! So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
From: Trey Harris
Here's a problem that *sounded* impossible... I almost regret postingthe story to a wide audience, because it makes a great tale over drinksat a conference. :-) The story is slightly altered in order to protectthe guilty, elide over irrelevant and boring details, and generally makethe whole thing more entertaining.
I was working in a job running the campus email system some years agowhen I got a call from the chairman of the statistics department.
"We're having a problem sending email out of the department."
"What's the problem?" I asked.
"We can't send mail more than 500 miles," the chairman explained.
I choked on my latte. "Come again?"
"We can't send mail farther than 500 miles from here," he repeated. "Alittle bit more, actually. Call it 520 miles. But no farther."
"Um... Email really doesn't work that way, generally," I said, tryingto keep panic out of my voice. One doesn't display panic when speakingto a department chairman, even of a relatively impoverished departmentlike statistics. "What makes you think you can't send mail more than500 miles?"
"It's not what I *think*," the chairman replied testily. "You see, whenwe first noticed this happening, a few days ago--"
"You waited a few DAYS?" I interrupted, a tremor tinging my voice. "Andyou couldn't send email this whole time?"
"We could send email. Just not more than--"
"--500 miles, yes," I finished for him, "I got that. But why didn'tyou call earlier?"
"Well, we hadn't collected enough data to be sure of what was going onuntil just now." Right. This is the chairman of *statistics*. "Anyway,I asked one of the geostatisticians to look into it--"
"--yes, and she's produced a map showing the radius within which we cansend email to be slightly more than 500 miles. There are a number ofdestinations within that radius that we can't reach, either, or reachsporadically, but we can never email farther than this radius."
"I see," I said, and put my head in my hands. "When did this start?A few days ago, you said, but did anything change in your systems atthat time?"
"Well, the consultant came in and patched our server and rebooted it.But I called him, and he said he didn't touch the mail system."
"Okay, let me take a look, and I'll call you back," I said, scarcelybelieving that I was playing along. It wasn't April Fool's Day. Itried to remember if someone owed me a practical joke.
I logged into their department's server, and sent a few test mails.This was in the Research Triangle of North Carolina, and a test mail tomy own account was delivered without a hitch. Ditto for one sent toRichmond, and Atlanta, and Washington. Another to Princeton (400 miles)worked.
But then I tried to send an email to Memphis (600 miles). It failed.Boston, failed. Detroit, failed. I got out my address book and startedtrying to narrow this down. New York (420 miles) worked, but Providence(580 miles) failed.
I was beginning to wonder if I had lost my sanity. I tried emailing afriend who lived in North Carolina, but whose ISP was in Seattle.Thankfully, it failed. If the problem had had to do with the geographyof the human recipient and not his mail server, I think I would havebroken down in tears.
Having established that -- unbelievably -- the problem as reported wastrue, and repeatable, I took a look at the sendmail.cf file. It lookedfairly normal. In fact, it looked familiar.
I diffed it against the sendmail.cf in my home directory. It hadn't beenaltered -- it was a sendmail.cf I had written. And I was fairly certainI hadn't enabled the "FAIL_MAIL_OVER_500_MILES" option. At a loss, Itelnetted into the SMTP port. The server happily responded with a SunOSsendmail banner.
Wait a minute... a SunOS sendmail banner? At the time, Sun was stillshipping Sendmail 5 with its operating system, even though Sendmail 8 wasfairly mature. Being a good system administrator, I had standardized onSendmail 8. And also being a good system administrator, I had written asendmail.cf that used the nice long self-documenting option and variablenames available in Sendmail 8 rather than the cryptic punctuation-markcodes that had been used in Sendmail 5.
The pieces fell into place, all at once, and I again choked on the dregsof my now-cold latte. When the consultant had "patched the server," hehad apparently upgraded the version of SunOS, and in so doing*downgraded* Sendmail. The upgrade helpfully left the sendmail.cfalone, even though it was now the wrong version.
It so happens that Sendmail 5 -- at least, the version that Sun shipped,which had some tweaks -- could deal with the Sendmail 8 sendmail.cf, asmost of the rules had at that point remained unaltered. But the newlong configuration options -- those it saw as junk, and skipped. Andthe sendmail binary had no defaults compiled in for most of these, so,finding no suitable settings in the sendmail.cf file, they were set tozero.
One of the settings that was set to zero was the timeout to connect tothe remote SMTP server. Some experimentation established that on thisparticular machine with its typical load, a zero timeout would abort aconnect call in slightly over three milliseconds.
An odd feature of our campus network at the time was that it was 100%switched. An outgoing packet wouldn't incur a router delay until hittingthe POP and reaching a router on the far side. So time to connect to alightly-loaded remote host on a nearby network would actually largely begoverned by the speed of light distance to the destination rather than byincidental router delays.
Feeling slightly giddy, I typed into my shell:
$ units1311 units, 63 prefixes
You have: 3 millilightsecondsYou want: miles
"500 miles, or a little bit more."
Trey Harris--I'm looking for work. If you need a SAGE Level IV with 10 years Perl,tool development, training, and architecture experience, please emailme at firstname.lastname@example.org. I'm willing to relocate for the right opportunity.