Monday, July 6, 2009

But did you do the phosphorus test?

I heard the phone clang down and my colleague Steve distraughtly mumble "She's going to kill the fish." His wife called to tell him about a phosphorus problem in their fish tank at home. She's a medical researcher, a biologist by training. Steve's first reaction when she told him there was a phosphorus problem was to ask if she had in fact done a phosphorus test. No, she said, but she'd run through all of the other chemical and algae tests, so of course it had to be the phosphorus and thus she'd started adding more phosphorus to the tank -- they'd know in a few days if that was the problem. Steve, imagining coming home to a tank of dead fish, was not impressed that his scientist wife had failed to use the scientific method at home.

It's so often like that in technology as well. Despite years of rigorous training to use the scientific method to guide our actions (it is called "computer science" for a reason), it's easy to throw all that away when faced with a challenge. A customer came to me the other day asking about monitoring tools to help with a production triage situation for a failing web service. A developer assigned to the task interrupted us saying that a fix had been deployed ten minutes prior and it looked like it was working. Let's reflect upon that:

a) No load or performance testing scripts existed for this web service.
b) No monitoring or profiling tools had been deployed with this service in either a pre-production or production setting.
c) A hopeful fix had been hot-deployed to production and left to run for a mere ten minutes before victory was declared.
d) No permanent monitoring was put in place to prevent the next occurrence of the problem.
e) Apart from a few manual executions of the service and a face-value assessment by one individual, no further validation to correlate the fix with the perceived problem occurred.

Chances are good that Steve's fish will be fine, but can the same be said for those cases where we play roulette with mission critical IT systems? Just as in the case of Steve's fish, there is no legitimate reason for a lack of objective, quantitative analysis except basic human apathy. Anyone who has ever taken a statistics course or been face-to-face with a serious production issue knows that just because many other tests have ruled out many options does not mean its safe to jump ahead and make assumptions just because of gut feeling -- why abandon a working method for one that brings doubt, risk, and exposure to criticism? Run the phosphorus test and let the results be your guide.

2 comments:

YaHuto N'Futo said...

My monstrous snail is dead. He was the old man of the tank. High phosphorus typically affects non-fish livestock such as snails, shrimp, anemone... My fat daddy crab has moved into the dead snails shell. Did the high phosphorous do-in the snail? Was it the remediation activity? Was it something unrelated yet mistakenly correlated? We'll never know, we didn't have a valid pre-incident comparative value. Krazy Krabby has the shell he needed. Life moves on.

Anonymous said...

Great follow up Steve -- see, proof positive that a failure to form a hypothesis, test accurately, and base decisions around the data collected is the only way to go.