"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too! So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
If that Power strip was so close to its limit that it couldn't carry the load but the built in Circuit Breaker didn't trip, you are lucky to be rid of it.
Although I've had some fun with bugs. One of the earliest I remember I had while learning to program with C. I had a program with a for loop that for some reason seemed to execute only one iteration, even though the control variable presented proper values as if it had run the entire for. After a long time I noticed that instead I had put a ';' besides the for, changing completely the meaning of the code and still compiling.
I have once also tried to compile something that made GCC throw an error and ask me to file a bug report with Debian, but I was too frustrated with work and deadlines so I threw it away and went for a walk. I still wonder what was the cause for the error.
Another that I just remembered: I was writing a program to solve a problem for a programming competition in a training session. I don't remember specifics, but it had something to do with dividing area, perhaps it could be shared. I and a colleague tried to use some kind of "marks" to separate the groups, but the "marks" always had some problem that made them similar enough to cause errors. Then we decided to call rand() and the program simply worked.
I had a program with a for loop that for some reason seemed to execute only one iteration, [...]I had put a ';' besides the for
I had a program with a for loop that for some reason seemed to execute only one iteration, [...]I had put a ';' besides the for
This type of error (and wrongly nested if/else constructs) I stopped making when I started using an editor* that was syntax aware and would show from the auto-indenting that something was different than I wanted.
*Emacs in my case, but surely there is a way to make vi/vim do the same. (ducking)
Yeah, I had just started learning, so I was using something like Crimson Editor. Now I have converted to VIm and Its lights shine through the code, allowing me to avoid those pitfalls. But Emacs is ok, I just don't like using Control and Meta.
some of the worst bugs i've battled have involved many wasted hours chasing after dead ends, but after a break and looking at the problem with fresh eyes i come away thinking "that was so simple and stupid, why didn't i see that at the start"
True. A break from the problem can even be - going to sleep. It's pretty amazing what you can do after a few zzz's.
One time, eons ago, I was driving over the Bay Bridge toward SF, my car stalled right after passing that middle island point. Tried to restart. No go. Cars behind me backed up. Started honking. I, too, have got super pissed.
Got out of the car. Kick the tire really hard. Gave it a real mean look. Because she knows I have a baseball bat in the back hatch.
Got back in. Gave it a start. It started on the second try.
Then I drove home.
The moral of the story? Shit if I know. Shit happens? Don't drive behind a clunker?
Sounds like vapor lock.
Whatever that is, you might be right if it has to do with carb and/or vacuum leak.
> ...vapor lockOr carb (inlet) icing?Was it a hot day? Or a cool muggy day?
Considering (s)he said "Bay Bridge toward SF", I'd vote for cool and muggy. Sounds like vapor lock to me as well.
Doctor: "Do you hear voices?"
Me: "Only when my bluetooth is charged."
In 2001 I was working support for the regional Mastercam reseller, and was given the duty of providing a customer with a customized "rolldie" post processor for use in the machining of rolling dies. The customer kept complaining that the geometry was coming out wrong. I took a fine tooth comb to the post processor and found nothing wrong, so I trigged it all out manually and found the results matched perfectly. My boss was hesitant to tell the customer they were wrong, so I told him I'd bet my balls the post was correct. "That's pretty confident" he said, and told the customer. They made the expensive service call to have their machine checked out, and it turned out their servo drivers were mis-tuned. After re-tuning (acceleration parameters and such) everything worked perfectly.
My various favorites (if you want to call them that considering how much hair was extracted from my scalp during the debugging process) are as follows:
-- A bug in some Unix implementation with what happened to output I/O buffers when some SIGPIPE is processed. The fix was to force the signal handler to flush it.-- A compiler bug in the IBM C compiler for AIX that generated incorrect code for a certain math expression. The fix was to break it up into several pieces creating intermediate results.-- A bug in the Java 1.2 String implementation that caused major headaches. The substring() methods apparently shifted indexes around internally and returned the same instance but toString() didn't take that into account.
The bug I enjoyed exploiting the most when I was a lowly student was on the university mainframe. If you wanted more memory than your account was allowed to consume, you simply forced your program to trigger an out-of-memory exception. The OS would then allocate more in order to handle the exception. Repeat as necessary. Eventually, you had enough and the error handler ran your business logic.
You can unsuscribe from events. Ethanol-fueled -= SoylentHomosexualityDickReceived;
Started from the bottom now I'm here - Drake
Some of the worst bugs I have had to chase down have been in OpenGL GLSL shader programs. They don't have any kind of workable debugger available for them (there is one on windows that requires a subscription to NVidia's developer program, and only one or two that run on linux, neither of which I have been able to get to run reliably). There is no easy way to generate print statements - your only options are to output to a buffer and then read the buffer values on the CPU-side or write out colour values to pixels and then look at the colour of the pixels.
On top of this the shaders have a parallel execution structure where a vertex shader executes many times but at a 1:N ratio to the pixel shader. The most difficult to track down bugs I have had in GLSL programs have pertained to floating point values changing slightly between the vertex and pixel shader programs. Depending upon how the shader code is written, this in turn can cause the logic of the program to execute differently between the vertex and pixel shader.
Moral of the story being never, ever, ever trust the accuracy of floating point values!
That's very interesting. How did you know there were bugs there in the first place? Do you see artifacts at a high level?
Yes, essentially the pixel shading of the final geometry wasn't matching up with the geometry that was being produced (there was actually a geometry shader in the mix too iirc). It actually looked like multiple triangles were being rendered at the same position - in essence it looked exactly like z-fighting between two polygons.
I spent ages thinking the geometry shader was somehow producing two polygons when it should have been only producing one. But in actuality it was just because the pixel shader was shading differently for each pixel based on a floating point value that was fluctuating above/below a particular value. The fault was entirely mine, but it took quite a long time to figure out exactly why.
Moral of the story being never, ever, ever trust the accuracy of floating point values!
This reminds me of when I used to play with VRML back in the 90s. I had a torus with a seam that wouldn't go away. Fix? Assert that the last point is *exactly* equal to the first point. If you don't, floating point can generate a difference between sin(0) and sin(2*pi). That tiny difference was enough to make the engine not close the shape properly.
I'm reminded of this entertaining debugging story [msdn.com] from 2004. One of the comments describes the OP as a "seriously hard-core debugging ninja here in [Microsoft] PSS"
An old J-Code Stepper Unit (bonus points if you have any idea what it is).
It was cycling and ghosting so bad we had to drop it off the code line after every time we used it. The local guys dicked around for months trying to fix it with no luck. It was configured by jumper wires on the back. The local boys checked the jumpers against the drawings a dozen times, and always had the same answer - all the jumpers are installed according to the drawings.
One day I stumbled across some very old plans that showed a slightly different jumper configuration, and then it hit me.
I called the local field service guy and got him to the site. He was not super excited about checking the jumpers again, so I really had to arm-twist to get him there.
The conversation kind of went like this - "How many jumpers are on the stepper?" "What?" "Count them, How many are there?", "Okay, fine, whatever, there are 17 jumpers". "How many jumpers are on the drawing?""15 jumpers"
a short pause while this sunk in and then
"Are you fucking kidding me!" A minute later the tech had found and pulled off the extra 2 jumpers and the unit ran perfect from that moment on.
End of the 90s I was working for a company that made cell phone base stations. We had cards to do all the DSP. Each card could handle 8 calls, each chassis held enough cards so the chassis could handle 4 T1 lines. Needless to say, reliability was critical and resetting a chassis was Not A Good Thing (tm). We would have a DSP die every few hours. Randomly. No idea why or when. Only way to get it back was to reboot the chassis. This went on for 6 months, and as the deadline approached it's visibility got higher.
Finally, my boss asked me to look into it. I knew nothing of DSP, worked for a different group, did not understand the code at all, but at least it was in C (compiled to some TI DSP chip). After about a week of reading the code and checking the manual to see what each library call did, I ran across an entry that said "Do not call this from an ISR". The code I was looking at was an ISR.
Rewrote the code, compiled, and shazaam! Problem solved. Got lots of brownie points across the whole project for that one :)
I worked with one of the older TI DSP kits - the first time I had to deal with an out-of-order execution processor for real-time workload. Not so good times, but at least the compiler generally seemed to work reasonably well.
When writing driver code, what you can and cannot do in interrupt routines is always critical. And not always clearly documented. And very hard to debug indeed.
Back in the early 80s we were selling an 8086 based data analyzer with a detachable keyboard. The stupid thing had a habit of locking up every once in a while for no apparent reason. One year we were at a trade show. My boss was doing a demo when the system hung. Unless you worked for the company you didn't know it was hung, but we all did. He said "And as you can see here on the back panel....", leaned over, and pushed his belly into the power button. Oops :)
Several months later they hired a consultant to look into the problem. He fixed it by changing the keyboard cable to one with a grounding sheath (think coax).
One more. In the early 90s we had a department-wide copier that was temperamental. The company sent a guy to fix it. He spent 1 day stripping it down to bare metal, and a second day putting it all back together. He ran a ream of paper through it, copying a test pattern. He then went into a conference room to finish up his paperwork.
I came in and showed him my copies, which were my copies overlaid with his test pattern. He went white and groaned. Then I couldn't help it and busted up laughing. Yep, I'd taken that ream of test patterns and put it back into the paper tray :)
I found a GPL violation in some code. The "developer" who violated it had changed *almost* all of the variable names. The few that he hadn't changed were odd, and rolled around in my head. This doesn't happen very often; but I actually had a dream that first led me to read a book with a similar title. Half way through the book I googled the exact text and found the offending code in an academic work that had just recently been crawled by Google.
I'm not sure if I should be proud or not of the way I handled it socially and administratively; but it was routed out. In fact, we ended up auditing everything which was a dull chore. No other problems found.
As far as real bugs go, the ones that are most challenging and always a relief to solve are the ones where one part of a C program steps on another. The actual point of failure is never where the bad code is, so you have to get real creative to find where somebody wrote into a structure "way over there" in another part of the code. I can't point to any one of those in particular though; just that general class of bugs always gives you a sense of relief and accomplishment.
AC to avoid dredging up hard feelings...
Due to its intermittent nature this one took a few days to get to grips with. One of my customers is a small/medium sized business. A few years ago their server would go off the air sometime after 11am, but possibly as late as close-of-business, almost daily. After eliminating server hardware and software and the network switch as the problem the only thing left was the cable in the wall. I lay an untidy 20M cat5 between rooms and sure enough the problem seemed to go away. They brought in a cable guy who found the interesting truth - the drain hole of an airconditioning unit was partially blocked - after a few hours, and depending on humidity, the aircon would would start dripping into the wall space directly onto the network plug.
I have similar software experiences, but they don't really make good stories. Probably my favourite story isn't a bug and barely qualifies as a hack, and it's quite old. I was dealing with hardware that didn't allow network booting/provisioning because of no boot ROM. I got around this by adding a small boot partition containing a network-enabled Grub ( /w software boot ROM) which pulled its configuration from a network location. I usually had the machines booting locally, but I could also tell individual machines to image themselves... I also had the machines default to booting locally if the network failed. This was NOT the way Grub network support was intended to be used... it was basically to allow a network-capable Grub bootmenu which had already been bootstrapped via PXE. I guess I like it because it saved me a LOT of work. BTW, is there any hardware these days that doesn't come with network boot code? Probably Raspberry Pi at least, but they're ARM-based and Grub wouldn't run on them.
I got called over because there was an Excel spreadsheet from a client that wouldn't open. I come over and immediately notice that the preview pane in our inhouse file browser shows that the first two bytes of the file are PK. So. I suggest changing the extension from .xls to .zip and seeing if it opens. It does open, but there are just XML files inside. Then I put 2 and 2 together and suggest renaming it to .xlsx.
We were on Excel 2003 at the time and even though the 2007 converter was installed, apparently Excel 2003 doesn't pass files with the wrong extension to the converter. Once it was renamed to .xlsx it worked.
All told, not the most difficult problem in the world, but I looked like a genius for putting all of this together immediately.
So, something like, on constructor, register (this) in some global registry, on destructor unregister (this) was going on the path of:
a. when created (inside a function), the instance would register a certain (this) value
b. the content of the ret value was memcopied from the stack and the (this) silently changed to the value of the left-hand side
c. the destructor of the ret value was never called
d. when the destructor of the left-hand of the assignment was called, the application coredumped.
Took me 4 days to discover the bug, mainly because I couldn't imagine a compiler being buggy. After getting the explanation, took another 3 day to change the code from "work with values and on the stack" on "work with pointers and dynamic memory allocation"
I have an AMD A10-5700 running Linux which would intermittently hang hard (no kernel dump, had to pull plug to restart, etc). Most of the time I'd come back to find the computer hung but sometimes it would hang when I'd be present. I'd typically hear the fan spin to very high freq right before this would happen.
Swapped power supply, memory to no effect. After trying FreeBSD, OpenBSD and NetBSD (FreeBSD worked fine, the others also hung but e.g. during bootup), and digging into source I tried using 'conservative' cpufreq instead of 'ondemand'. Problem solved.
Details here [askubuntu.com].
This one stumped everyone for a couple months.. while it wasnt code, I think it fits in here in spirit.
We had a client with an old PC that had their POS system on it. I was the 2nd tech on the site. The first one noted that the client's mouse had died and had replaced it a few weeks earlier. I arrived again to another dead mouse, which I replaced talked with them about their old machine, and left with everything in order again. Once more, a few weeks passed and again I went onsite to find their mouse was dead. At this point I decided it must be motherboard or some hardware issue, so I replaced the system, reloaded the software, and went on my way. A few weeks later.. Ill be gotdayam if the mouse hadnt died again. w T F. Ok. Something just isnt right here.. so I broke out the multimeter and testing voltage found 120V from neutral to ground at the plug. I told the client to get an electrician out, left a new mouse and never had another call from them again.
Why should that have been a problem?Here in Europe, mains plugs are symmetrical.You cannot know in advance which pin will be live, and which will be neutral.If a device only works correctly with the plug in one direction, then that device is defective (and could cause safety problems if it is not properly grounded).The power supply must, under all circumstances, provide sufficient isolation between ground and live, but also between ground and neutral.How is this different in the USA?
Here in the US, Hot, Ground, and Neutral are always discreet and seperate as well as in modern wiring specifically positioned in the wall plug. While I am not sure exactly what was wrong, I suspect that the hot wire was intermittently shorting to neutral, like one strand of wire occaisonally, which I managed to catch.
The company I was at had a customer phone prompt system based on account numbers. When one particular customer entered his code, the system would end the call. A colleague who I greatly looked up to was stumped, and asked me to take a look. Just troubleshooting out loud, I said "Could the ampersand in the customer name be causing this?" She changed the ampersand to 'and' and that fixed it.
I was green and fresh from college. Although I was a programmer, one of my first jobs was to support a reporting app we had. One day, we ran out of space breaking not only our app but a bunch of others also running on the server. I spoke to the sys admin and he gave us more space. Problem solved. Never having actually looked at how much space we were taking up, I reported what we found up the chain of command and moved on. Then it happened again. And again. The sys admin accused us of storing too much data. I finally started keeping track of how much space we were using and how much free space was on the hard drive... manually. (I kept running the "dir" command on the command line and piping it to a file then placed the numbers in a spreadsheet to track it.) As we got close to not having enough disk space, I'd ask for more to prevent the apps from crashing.
I noticed a couple of things. Our app would suck up a tremendous amount of the free space and then release it. It made tracking how much we were using very difficult. Strangely, it stayed fairly consistent, but the amount of free space on the network drive kept getting smaller and smaller. I came to the conclusion that something other program that I couldn't see was eating up free space. I began to accuse the sys admin of not being able to spot the problem. My boss eventually had to step in.
Long story short: Because of how critical it was becoming, others jumped on board to find the root cause of the problem. Someone found a storage of cache that our program had that I was not given rights to see (for whatever reason) and the sys admin didn't know about either. Once we setup a schedule to flush the cache, the problem was solved. I realized on that day that because of supposed security measures, neither the sys admin nor I could do our jobs and we wound up blaming each other for it. We'd both been victims of being silo'd.
in 2001 I spent 8 hours chasing my tail in a Perl program. Turned out I had forgot to turn on "use strict" and it was just a variable typo that wasn't being caught. I didn't think of that, after all, that's an error the interpreter catches...
I once spent several days to hunt down a memory corruption bug. And of course, when running in the debugger, the bug didn't appear, so I had to narrow down the bug's location by trying. Finally, I found the bug in a nested if condition in a totally unrelated module that looked something like this:else if ((ptr = malloc(sizeof(struct x))) === NULL)The bug was in the compiler (Sun's C compiler) which made sizeof(struct x) return 4 at that point, no matter the true size of the struct, with our default compiler flags. Turning on the debugger flag made it yield the correct value.
else if ((ptr = malloc(sizeof(struct x))) === NULL)
The compiler didn't complain about the "==="?
Weird, isn't it?
Always feels good to dive into a problem, dig through some layers of misdirection, and find a quick and easy fix. About six months ago, we added a new suite of tests to the application I work on at work, using a relatively untested testing framework. For some reason, after about a month, the step to parse the output from the test suite was taking so long (on the order of an hour) that the build machine would usually just fall over and spit back a warning. It was an awful pain -- every time the continuous build ran, you had to go look at the actual output from the tests yourself to make sure you hadn't broken anything -- and it wasted resources
The fix, in the end, was one character -- I added a non-greedy-match token to the regex that parsed the test output. Turns out, a bunch of the test suites had the same name, and the parser didn't expect that to happen. With the greedy matching, the parser was matching the beginning of each suite with the end of all the others that followed. That, of course, didn't scale well as we added more test suites!
Not being a "real" programmer, my personal favorite involves HTML/CSS. About 8-9 years ago I had coded a web page template in a basic three-column format. The three columns had unique IDs (IIRC: col1, col2, and col3). The page rendered correctly in all browsers... except Internet Explorer. (Heard that before, huh?) But I could not figure out what was wrong. The HTML and CSS validated. I wasn't tripping any IE bug that I knew of. After endless fiddling, for some reason I changed col2 from an ID to a class in the CSS and the HTML, and the damn thing rendered fine.
I swear, there were no errors or naming conflicts anywhere in the code. (I write very neat HTML and CSS, and was once complimented by a "real" programmer that my code looked like something from a textbook.) I Googled around and never found anyone who had ever reported a similar problem. To this day, I have no idea what the bug was or why that solution worked.
I had one problem I was looking into that when I tried stepping through with a debugger it worked fine as a couple of threads were interacting with each other. I added some debug statements that would write information to a file so I could trace back the interaction afterwards instead of real-time.
I then couldn't reproduce the problem. It seemed that running the debug code I had just added had slowed that thread down just enough to stop the problem, whatever it was.
Even though the problem had been "fixed", I still needed to find out what the actual cause was so removed the debug bits I had just added one by one until it stated happening again, to find out where the problem was occurring.
From there I read through the code a couple of times until I found a potential interaction that could cause the issue. In order to confirm I had to set the breakpoint to pause the entire JVM instead of just the one thread. This confirmed the issue and I was able to actually fix it properly by making the relevant code more thread safe.
It was certainly interesting writing the unit tests for that one.
This confirmed the issue and I was able to actually fix it properly by making the relevant code more thread safe.
I'm sorry, but code is either thread safe or it's not thread safe.
Did you make it thread safe or did you just add some code that hides the problem for now?
I meant one thread was synchronised with another thread, but third thread wasn't. That was what was causing the issue.
I've had a few gotchas that really took some digging to figure out. First was a customer who insisted that his dedicated server was being hacked. I kept having to restore /etc/, /bin/ many other directories. They were *gone*. The third time it happened, he admitted that it might have been something he was trying to do with his server. He was running a shell script. And in it was the command "rm -rf /usr/home/foo /usr/home/bar / /usr/home/fubar/". Whoops.
Then there was an old lady who swore her computer was hacked, the mouse kept moving while she was typing. Other techs had scanned the machine several times and declared it clean. I asked "is this a laptop?" (mind you this was over the phone) and it was. I showed her how to disable the touch pad that she kept bumping.
Another one also involved an older person and a mouse. There was something definitely wrong with their computer. Programs wouldn't open, nothing acted right. It worked fine for me until I realized that I was having to POUND on the mouse to get it to click. I convinced him that a new mouse was needed, and that was that.
There are others too, long and drawn out ones that I just don't remember.
But there is one that I most definitely do. I was doing onsite computer repair for an outfit in Reno. A rather difficult customer had us out to do some work, and in the midst of it, she wanted her DSL modem and router moved from below the desk to above it (or vice-versa). The modem lit up, the router lit up, she was able to get online.
The next day she called. Her Internet was not working. Since I had just been out there, I suggested she call the local telco. She did, and they insisted everything was fine on their end, and I believed them. I went out there a total of 3 times. The third time, I was on the phone with the telco to tell them how dumb they were, when something struck me. I unplugged the power from the modem and router, swapped them, and plugged them back in.
They were both 12V and the same polarity, and same connector. They were *very* different amperes. One was almost 2A and the other only 500mah. She was online with no further troubles. I reversed them when I moved them!
Had a user that complained that a certain application would't let him copy & paste. I walked to his desks, and did the "show me" routine. Sure enough, copy & paste didn't work.
I installed said software on a test PC, and it worked just fine. Therefore program is with his PC, right? I had him demonstrate again. Same issue. I took his seat, blindly checking settings. I tried - and copy & paste worked for me.
We went back & forth - if I was in the seat, it worked. I would stand up, he would take the keyboard, and 2 seconds later have the problem.
The computer only had the problem when he was at the keyboard. It hated just him.
Well, about 15 minutes of this back & forth, I noticed that he was doing something ever so slightly different. He was clicking about 2 pixels to the left of where I was. Lousy app had a selection dead-spot right where he was clicking.
I still prefer to think the machine hated him.
The car was fine after starting. To get it started took luck and eventually a jump starter kit. Even then sometimes it just would not turn over. Not a battery problem. Not the engine.It was very difficult. There was no pattern. Just that some days the ignition failed.
One day I pulled the cable connecting the immobiloser to the battery off and cleaned the connection. Problem solved. The end of the cable had rusted and the immob unit was not getting enough power sometimes to start.
A PHP codebase I inherited kept replacing non-ASCII characters in form input by random junk. Turns out there was a cast $output = (string) $input; somewhere, and casting a string to a string is clearly intended to garble up unicode, because PHP.
I work for a moderately large mining company in one of our many and varied engineering departments.
I had a co-worker typing out a long and detailed email to the IT department on some requirements we needed for a new server for a monitoring application. Midway through the email she begins to curse quite profanely at her terminal, it seems to have taken a mind of its own. I pop over and yes, the terminal is doing all sorts of bizarre things, miss-typed characters, random ASCII, popping in and out of menus. Just then another colleague pops by and added his two cents worth. We go through some basic trouble shooting and it seemed windows had managed to change the keymap somehow.
Out of curiosity we try and replicate the change and in the process someone tries ALT-S and in a flash, the email that had been open and the recipient of all the keyboard misshaps promptly fires off to all intended recipients.
Now, I don't know if was the email full of gibberish or IT just being soft, but they bent over backwards to get that server online and with the specs we needed pronto.
Whilst we saw the funny side, the keyboard didn't survive the event.
This 'bug' lasted YEARS... way back (maybe 10+ yrs ago) came across a bug which appeared to be due to returning a bool from a C function. Really wierd. Eventually replaced all 'bool' with 'BOOL' (like MS Windows). Years pass. Still occasional crashes/corruption in the odd little project - ah! the old 'bool' vs. BOOL bug!
No. One day found a #pragma pack in a long-forgotten header file, some code compiled with it, some without. Different structure packing. I still feel ashamed.
It was the night of January 25, 2003. I was working at a webhoster. We were migrating to a new datacenter. Meaning we prepped the new datacenter, switched off all the servers in the old location, moved them, and switched them on again.
Hours and hours of racking, stacking, pulling cables, testing cables. All through the night. From 23:00 till 06:00 or so. We were beat, but we were done. We started flipping on switches and routers. All looked good. We started flipping on servers. Round about the time we switched on the last 10 or so servers, all the switches and routers lit up like a christmas tree. Blinking lights started to furiously flicker. We thought there was something wrong with the last couple of servers, so we switched those off. The problem persisted. Restarted the switches. That didn't solve anything. We started to switch off servers rack by rack. By this time, customers were starting to wake up and call as well, since we had drifted outside the maintenance window. Everything went crazy.
After some more trial and error, we noticed that if we turned on Windows Servers, the problem would return. Right about that time, our upstream network provider called. They noticed issues on ipaddresses that were running SQL Server. And they were blocking that traffic from that point on.
We had migrated a datacenter on the exact date SQL Slammer became active. Shitiest timing. Ever.
Two different network gone apeshit stories for me:
I remember those PCMCIA card network adapters fondly.There were a number of different brands, and they all used a similar card to RJ45 dongle. They were even kind enough to share the same physical plug. Except, there didn't seem to be any standardisation between brands on what pinouts to use for the plug.
In a lot of cases, using the plug from one card with a different card would lock up the laptop hard. I had a drawer of cards and dongles and had to work out which dongle went with which card. That was a fun day...
One of my early instructors told me this one, it happened in the 60s. They had a mainframe that suddenly started to reboot randomly for no reason. They had the company's techs out several times, ran all sorts of diagnostics, the usual.
The problem was traced to an aging floor beam. Seems the power conduit ran under a toilet, whenever someone sat on that toilet there was enough give that the power to the mainframe dropped enough to cause a reboot.
I had a server that was plugged into a dedicated circuit. About 3 or 4 times a day the server would reboot and we couldn't figure out why. Someone switched the outlets of the server to an outlet that shared the same circuit as the Ice Machine in the break room. Every time the ice maker made ice the server would loose power and then reboot.
From: Trey Harris
Here's a problem that *sounded* impossible... I almost regret postingthe story to a wide audience, because it makes a great tale over drinksat a conference. :-) The story is slightly altered in order to protectthe guilty, elide over irrelevant and boring details, and generally makethe whole thing more entertaining.
I was working in a job running the campus email system some years agowhen I got a call from the chairman of the statistics department.
"We're having a problem sending email out of the department."
"What's the problem?" I asked.
"We can't send mail more than 500 miles," the chairman explained.
I choked on my latte. "Come again?"
"We can't send mail farther than 500 miles from here," he repeated. "Alittle bit more, actually. Call it 520 miles. But no farther."
"Um... Email really doesn't work that way, generally," I said, tryingto keep panic out of my voice. One doesn't display panic when speakingto a department chairman, even of a relatively impoverished departmentlike statistics. "What makes you think you can't send mail more than500 miles?"
"It's not what I *think*," the chairman replied testily. "You see, whenwe first noticed this happening, a few days ago--"
"You waited a few DAYS?" I interrupted, a tremor tinging my voice. "Andyou couldn't send email this whole time?"
"We could send email. Just not more than--"
"--500 miles, yes," I finished for him, "I got that. But why didn'tyou call earlier?"
"Well, we hadn't collected enough data to be sure of what was going onuntil just now." Right. This is the chairman of *statistics*. "Anyway,I asked one of the geostatisticians to look into it--"
"--yes, and she's produced a map showing the radius within which we cansend email to be slightly more than 500 miles. There are a number ofdestinations within that radius that we can't reach, either, or reachsporadically, but we can never email farther than this radius."
"I see," I said, and put my head in my hands. "When did this start?A few days ago, you said, but did anything change in your systems atthat time?"
"Well, the consultant came in and patched our server and rebooted it.But I called him, and he said he didn't touch the mail system."
"Okay, let me take a look, and I'll call you back," I said, scarcelybelieving that I was playing along. It wasn't April Fool's Day. Itried to remember if someone owed me a practical joke.
I logged into their department's server, and sent a few test mails.This was in the Research Triangle of North Carolina, and a test mail tomy own account was delivered without a hitch. Ditto for one sent toRichmond, and Atlanta, and Washington. Another to Princeton (400 miles)worked.
But then I tried to send an email to Memphis (600 miles). It failed.Boston, failed. Detroit, failed. I got out my address book and startedtrying to narrow this down. New York (420 miles) worked, but Providence(580 miles) failed.
I was beginning to wonder if I had lost my sanity. I tried emailing afriend who lived in North Carolina, but whose ISP was in Seattle.Thankfully, it failed. If the problem had had to do with the geographyof the human recipient and not his mail server, I think I would havebroken down in tears.
Having established that -- unbelievably -- the problem as reported wastrue, and repeatable, I took a look at the sendmail.cf file. It lookedfairly normal. In fact, it looked familiar.
I diffed it against the sendmail.cf in my home directory. It hadn't beenaltered -- it was a sendmail.cf I had written. And I was fairly certainI hadn't enabled the "FAIL_MAIL_OVER_500_MILES" option. At a loss, Itelnetted into the SMTP port. The server happily responded with a SunOSsendmail banner.
Wait a minute... a SunOS sendmail banner? At the time, Sun was stillshipping Sendmail 5 with its operating system, even though Sendmail 8 wasfairly mature. Being a good system administrator, I had standardized onSendmail 8. And also being a good system administrator, I had written asendmail.cf that used the nice long self-documenting option and variablenames available in Sendmail 8 rather than the cryptic punctuation-markcodes that had been used in Sendmail 5.
The pieces fell into place, all at once, and I again choked on the dregsof my now-cold latte. When the consultant had "patched the server," hehad apparently upgraded the version of SunOS, and in so doing*downgraded* Sendmail. The upgrade helpfully left the sendmail.cfalone, even though it was now the wrong version.
It so happens that Sendmail 5 -- at least, the version that Sun shipped,which had some tweaks -- could deal with the Sendmail 8 sendmail.cf, asmost of the rules had at that point remained unaltered. But the newlong configuration options -- those it saw as junk, and skipped. Andthe sendmail binary had no defaults compiled in for most of these, so,finding no suitable settings in the sendmail.cf file, they were set tozero.
One of the settings that was set to zero was the timeout to connect tothe remote SMTP server. Some experimentation established that on thisparticular machine with its typical load, a zero timeout would abort aconnect call in slightly over three milliseconds.
An odd feature of our campus network at the time was that it was 100%switched. An outgoing packet wouldn't incur a router delay until hittingthe POP and reaching a router on the far side. So time to connect to alightly-loaded remote host on a nearby network would actually largely begoverned by the speed of light distance to the destination rather than byincidental router delays.
Feeling slightly giddy, I typed into my shell:
$ units1311 units, 63 prefixes
You have: 3 millilightsecondsYou want: miles
"500 miles, or a little bit more."
Trey Harris--I'm looking for work. If you need a SAGE Level IV with 10 years Perl,tool development, training, and architecture experience, please emailme at firstname.lastname@example.org. I'm willing to relocate for the right opportunity.
In the 90s, I was responsible for getting a roughly 250 kLOC MacApp 3 software for medical device control stable. A few crashers were quite hard to nail down. The one I'm most proud of, I guess, was a bug in Script Manager. Under certain circumstances, we saw seemingly random memory get overwritten. I narrowed it to a specific call and drilled down into the OS assembly with MacsBug. Turned out Script Manager wrote to an address in a register it did not touch before. I found a workaround and reported it. Got me a very nice reply from DTS for the "hard work in MacsBug" :)
Another hard one where I can only claim an "assist" was with SetCursor(), which was "guaranteed" interrupt-safe and MacApp switched on the watch in an interrupt, given that guarantee. However, the Control Strip patched SetCursor and went on to do interrupt-unsafe memory handling. This led to very rare crashes. A bit of back and forth exchange with DTS led to nothing. Then I noted that some adresses pointed to a pattern that looked suspiciously like a bitmap, and after drawing it down by hand, I figured out that it was one of the mouse cursors. With that info, I reported back again, and after a few days DTS responded that they found the offender.
Honourable mention goes to a Linux Kernel (2.4.2x) issue where jffs2 would ignore a readonly-mount flag, which we noted when checksums for a supposed-to-be-read-only boot file system changed. Tracking the bug down that wasn't that hard, and the maintainer (David Woodhouse) went for a slightly bigger scope solution than my mailed-in patch, but I got credits in the change log for the Kernel :)
And right now, I've got my teeth bitten into what seems like a cache coherency / TLB integrity issue with a softcore CPU (Microblaze). It seems to read a wrong value from a SYSV shared memory area about once every 5 minutes, and only in a very specific application software setup. If my analysis so far is correct (I can _most_ likely rule out a race) and I get to fix it, that will definitely make this list :)
So it's the mid 1990's. I work for a small consulting shop. These are the days before RDP- laplink and Carbon Copy were installed only where we could justify a dedicated analog phone line.Anyway, one day the receptionist at a large client called us, complaining about a strange beeping noise. It only happened when she was at her desk. I drive over, and check things out. Her PC was just fine. Nothing usual.Then she says, "here, let me make some space for you to work..." - and she lays her coat on top of the keyboard of the other computer at her desk...
In the 1990s, I worked for a vendor that made servers, large and small, and quite a few incidents were big wins ...
1. A client complained that their UNIX V.3 machine stopped accepting any kind of data entry, and raised hell to the manager. So I was called to go and investigate on site. They restored the backup, and started entering the invoices into the system (was an RM/COBOL application). When the system stopped, I found that a data file was exactly 32MB (or some round number like that). When I looked at their shell session, ulimit was set to that exact number, to prevent runaway processes from eating up disk space. I changed the limit in their shell .profile or somesuch, and it worked from the first try. The client was so thankful that he sent a glowing thank you letter to the world headquarters of the company.
2. Another client (a bank) who just converted from old UNIX System V.3 to UNIX SVR4 complained that CPU utilization spikes mid-day around 11 am when customers rush in for transactions. I used truss [idevelopment.info] (the SVR4 equivalent of strace [wikipedia.org]) to monitor their application (also written in RM/COBOL). I found that it returned busy because a file was locked. Their application, instead of returning an error to the user, went into a loop retrying the operation, only to find it locked and retry again, eating up CPU time. I informed the developers of the issue, and they changed the code to sleep for a second before retrying, and the problem went away.
3. A client had a Decision Support System (DSS) written in Visual Basic, querying their datawarehouse server than ran Teradata [wikipedia.org], which is a massively parallel database just for DSS apps. The VB app allowed them to do ad-hoc queries for a very large data set. When the app was launched, they found out that a crucial query was taking hours and not responding back. When I investigated, it turns out the Microsoft's Jet Engine database layer wanted to retrieve the entire tables locally to the Windows PC's memory, then do the database JOIN locally. Problem is: table was too big, and the connection was on a 64kbps leased line! The solution I devised was to ditch Microsoft's Jet Engine, and go directly to the Teradata ODBC layer. That way, the JOINs were done inside the database as they should be, and the query took minutes to execute.
4. There was another bug that I do not remember, but had to trouble shoot it remotely over the phone with someone who does not speak English. He was reading back the English output describing the shape of the letters, and I instruct him to type the Arabic keys that had the letters I wanted typed (e.g. ls -l). It worked to my surprise, and I was lucky that we both had keyboards with the same Arabic layouts, something rare in those days.
First one in the mid '80s. We were using a cluster of 4 PCs connected by LANtastic (yes, LANtastic) to sort large (for the time) database indexes. The system would split the unsorted data into chunk files small enough to be held in memory, quick sort each chunk, and then perform a distributed merge using mailbox files to coordinate the sort.
One day, an index is out of order post sorting. Two entries are transposed. The good news is that everything is deterministic and we still had the input, so it should be easy enough to re-produce. However, after watching those two entries go through the whole process, it comes out in perfect order. Running the input again without debugging produces a perfectly ordered index as well. Many re-runs with that dataset and others, including pathological inputs all come out fine.
Conclusion? Random bit flip in a CPU flag. The bug never happened again.
Next up, debugging LinuxBIOS (now Coreboot) on a new MB. Serial port isn't coming up and post codes aren't making it to the PCI bus. After a bit of testing, I find that I can toggle the power light by frobbing a couple bits I can reach. Devise a blink code to get minimal debugging info until I get the serial port up.
A while ago, on a school team project, a couple of us were working on some hc11 assembly. My teammate was tearing his hair out over some bug, exclaiming something like "OK! Load Immediate two-zero! Why doesn't this work?!" I leaned over and squinted at his xterm with tiny fonts, and said "You forgot the 'immediate' symbol." (FYI this was one of those micro families where the official syntax for immediate mode was given like 'LDAA #$op' rather than a separate mnemonic like 'LDAAI $op'.) He peered closer at the line, and said, "Wow! You have good eyesight!" - to which I said, "Not really - there was just an insufficient number of blurry symbols before the 20."
#1 final level of phone support for a large ISP in the '90s - customer had dialup internet dropping and unable to reconnect twice a day at times which changed slightly each day. problem? his phone exchange was being submerged at high tide.#2 customer calls in with tiny colourful spots on his monitor, drivers fine, hardware seemed fine, eventually asked customer to wipe the screen - success. problem? spittle