Thursday, March 19, 2015
It's Not Your Imagination, It's REALLY Broken
However, there are those days where things look to go terribly wrong, where the information radiator is bleeding red. Your tests have caught something spectacular, something devastating. What I find ironic about this particular situation is not the fact that we jump for joy and say "wow, we sure dodged a bullet there, we found something catastrophic!" Instead, what we usually say is "let's do a debug of this session, because there's no way that this could be right. Nothing fails that spectacularly!"
I used to think the same thing, until I witnessed that very thing happen. It was a typical release cycle for us, stories being worked on as normal, work on a new feature tested, deemed to work as we hoped it would, with a reasonable enough quality to say "good enough to play with others". We merged the changes to the branch, and then we ran the tests. The report showed a failed run. I opened the full report and couldn't believe what I was seeing. More than 50% of the spun up machines were registering red. Did I at first think "whoah, someone must have put in a catastrophically bad piece of code!" No, my first reaction was to say "ugh, what went wrong with our servers?!" This is the danger we face when things just "kind of work" on their own for a long time. We are so used to little hiccups that we know what to do with them. We are totally unprepared when we are faced with a massive failure. In this case, I went through to check all of the failed states of the machines to look for either a network failure or a system failure... only none was to be found. I looked at the test failure statements expecting them to be obvious configuration issues, but they weren't. I took individual tests and ran them in real time and watched the console to see what happened. The screens looked like what we'd expect to see, but we were still failing tests.
After an hour and a half of digging, I had to face a weird fact... someone committed a change that fundamentally broke the application. Whoa! In this Continuous integration world I live in now, the fact is, that's not something you see every day. We gathered together to review the output, and as we looked over the details, one of the programmers said "oh, wow, I know what I did!" He then explained that he had made a change to the way that we fetched cached elements and that change was having a ripple effect on multiple sub systems. In short, it was a real and genuine issue, and it was so big an error that we were willing to disbelieve it before we were able to accept that, yep, this problem was totally real.
As a tester, sometimes I find myself getting tied up in minutiae, and minutiae becomes the modus operandi. We react when we expect. When we get delivered to us a major sinkhole in a program, we are more likely to not trust what we are seeing, because we believe such a thing is just not possible any longer. I'm here to tell you that it does happen, it happens more than I want to believe, or admit, and I really shouldn't be as surprised that it happens.
If I can make any recommendation to a tester out there, if you are faced with a catastrophic failure, take a little time to see if you can understand what it is,and what causes it. Do your due diligence, of course. Make sure that you re not wasting other people's time, but also realize that, yes, even in our ever so interconnected and streamlined world, it is still possible to introduce a small change that has a monumentally big impact. It's more common than you might think, and very often, really, it's not your imagination. You've hit something big. Move accordingly.