Thursday, November 4, 2021

How to Tame Bugs in Production (an #OnlineTestConf 2021 Live Blog)


One of the interesting aspects of being a firefighter is the fact that you spend a surprisingly small amount of time actually fighting fires. Frankly, that's a good thing but what they do is they study up on methods to fight fires, prepare to have their equipment at the ready should they need to, and reach out into the community to help people avoid fires in the first place. Granted, fires are to be expected and they deal with it. They know that they can do a lot to be prepared, and they can do a lot to help mitigate fires in the first place but no matter how much training and mitigation they do, fires will occur and the best option for them is to be calm, in control and deal with them when they happen. 

To this end, bugs that appear in production are akin to the occasional fire. The point is, they happen. We can't prevent every single bug from making it into production any more than any fire department can prevent every fire from happening. At some point, they just appear due to conditions being right, and then the bugs have to be dealt with. In short, the fire needs to be put out. 

Elena points out in her talk that often, bugs that appeared in her product (in this case, she was talking about developing and shipping the game "Candy Crush") there were specific issues with third-party products that the app depended upon. Thus it was clear that there were going to be bugs no matter what happened. As such, it helps to realize that instead of panicking it would make more sense to monitor the bug, see the impact, determine the best approach to fix it, and release an update in the least obtrusive way possible. IN today's online delivered and SaaS environments, this is a possibility. It's nothing like the days of game and software development where software went out on CDs or DVDs and was effectively treated as "eternal". Still, even in those days, we didn't just despair. Instead, we spent time looking at what caused the bug, how to avoid reaching that point and then publishing this fact. 

The first and foremost consideration is to consider what the impact on users is. Show-stoppers would be where the game or app crashes or literally prevents users from being able to advance or complete a workflow. Elena shared an example where the game she was releasing had an issue with the Chinese language display on the login. The question was "do we release as is and patch, or do we fix it immediately? This came down to a question of "how much of an issue would it be for an extensive group of users?" If it affects a small group of users or it's fairly localized, or it's a situation that other context clues can help to get past (login pages tend to be pretty standard), then we can determine it may not be all that critical to stop everything and fix. As long as there is a communication of the issue and a plan to fix it, for many users, that's fine, even if the bug affects them.

In a game setting, often a way to mitigate a bug in production is to reward users who stick with you until the fix is applied. Something on the order of "Hey, thank you to everyone who was patient with us while we were working on this issue. As a thank you for your patience, we are awarding everyone 10,000 Gil" (yes, I'm a Final Fantasy fan, sue me ;) ). The point is, acknowledge the issue, be effective with how to interact with it, look at a legitimate timetable, and then plan your course of action. Also, this emphasized the benefit of working on or preparing for issues. This is a great opportunity for testers especially to examine rapid response protocols or to look at some broader areas of testing that may not be as pertinent in the immediate moment.    

In general, I think the firefighting approach makes the most sense but I appreciate the fact that Elena approached the firefighting metaphor from a different angle. We usually focus on the immediate fire and the panic that goes with it. Seasoned firefighters don't panic because they have trained for exactly these scenarios. We as software developers and testers would do well to emulate that example :).

No comments: