martyb writes:
"Remember that one bug that had you tearing your hair out and banging your head against the wall for the longest time? And how it felt when you finally solved it? Here's a chance to share your greatest frustration and triumph with the community.
One that I vividly recall occurred back in the early 90's at a startup that was developing custom PBX hardware and software. There was the current development prototype rack and another rack for us in Quality Assurance (QA). Our shipping deadline for a major client was fast approaching, and the pressure level was high as development released the latest hardware and software for us to test. We soon discovered that our system would not boot up successfully. We were getting all kinds of errors; different errors each time. Development's machine booted just fine, *every* time. We swapped out our hard disks, the power supply, the main processing board, the communications boards, and finally the entire backplane in which all of these were housed. The days passed and the system still failed to boot up successfully and gave us different errors on each reboot.
What could it be? We were all stymied and frustrated as the deadline loomed before us. It was then that I noticed the power strips on each rack into which all the frames and power supplies were plugged. The power strip on the dev server was 12-gauge (i.e. could handle 20 amps) but the one on the QA rack was only 14-gauge (15 amps). The power draw caused by spinning up the drives was just enough to leave the system board under-powered for bootup.
We swapped in a new $10 power strip and it worked perfectly. And we made the deadline, too!
So, fellow Soylents, what have you got? Share your favorite tale of woe and success and finally bask in the glory you deserve."
(Score: 2) by lhsi on Saturday March 08 2014, @10:43AM
I had one problem I was looking into that when I tried stepping through with a debugger it worked fine as a couple of threads were interacting with each other. I added some debug statements that would write information to a file so I could trace back the interaction afterwards instead of real-time.
I then couldn't reproduce the problem. It seemed that running the debug code I had just added had slowed that thread down just enough to stop the problem, whatever it was.
Even though the problem had been "fixed", I still needed to find out what the actual cause was so removed the debug bits I had just added one by one until it stated happening again, to find out where the problem was occurring.
From there I read through the code a couple of times until I found a potential interaction that could cause the issue. In order to confirm I had to set the breakpoint to pause the entire JVM instead of just the one thread. This confirmed the issue and I was able to actually fix it properly by making the relevant code more thread safe.
It was certainly interesting writing the unit tests for that one.
(Score: 1) by stderr on Saturday March 08 2014, @01:29PM
I'm sorry, but code is either thread safe or it's not thread safe.
Did you make it thread safe or did you just add some code that hides the problem for now?
alias sudo="echo make it yourself #" #
(Score: 2) by lhsi on Saturday March 08 2014, @03:07PM
I meant one thread was synchronised with another thread, but third thread wasn't. That was what was causing the issue.