Tuesday, February 8, 2011

BOOK CLUB: How We Test Software at Microsoft (15/16)

This is the first part of Section 4 in “How We Test Software at Microsoft ”. We arein the home stretch now, just one more chapter to go after this one! This section deals with the idea of solving future testing problems today where possible, both in the testing technique sphere with failure analysis and code review and in the technology sphere with virtualization. Note, as in previous chapter reviews, Red Text means that the section in question is verbatim (or almost verbatim) as to what is printed in the actual book.

Chapter 15: Solving Tomorrow’s Problems Today

This section starts out with Alan making the case for the fact that, while software testing is an expanding field, and one that is looking to get more respect over time (especially considering where it was a decade or two aqo), it still suffers from the fact that the paradigm of software testing is still one of reactive thinking.

Why do we hire testers? Because we proved that developers couldn’t find all of their own bugs, and that developers perhaps weren’t in the position to be the most effective at that process, anyway (just like I’m probably not the most effective person to debug and test my own test scripts or review my own test plans).  For the state of the tester’s art to improve and flourish, part of the effort is going to require that we stop working exclusively in a reactive mode and work more towards finding pro-active solutions  to the situations we are facing. Microsoft is no stranger to many os these issues and questions, and in this chapter, Alan goes through and describes some of the more forward looking methods that Microsoft is implementing to try to get a handle on the future of testing and the challenges it will present.

Automatic Failure Analysis

The scary thing when it comes to a company like Microsoft is that, with all of the applications, their respective platforms, flavors that run on the desktop, on the web and in the cloud, mixed with language support, a platform or application could have tens of thousands of test cases, and quite possibly more. With a few failed cases, analysis is relatively easy to focus on the specific issues raised during the tests. When dealing with test points that may number in the hundreds of thousands or even millions, looking at 15 of failures for such systems is still a terrifying proposition (if you have a million total test cases, and you pass 99% of them and only fail 1%, where do you begin to whittle down 10,000 test cases?

Automated testing allows for the ability to run thousands of test cases (perhaps even hundreds of thousands of cases) with very little human interaction. Testing analysis, on the other hand, still requires a human tough and observation. What do we do when that needed human touch requires us to look at tens of thousands of cases, the desire for automated methods to at least providing some first level test analysis is clear. Implementing such a question, of course is another thing entirely.

Overcoming Analysis Paralysis

Too many failures requiring too many test cycles to determine what is happening can become old, daunting and downright scary really quickly. How is a tester supposed to deal with all of them? As said above part of the focus needs to be on creating automation that will allow for first order analysis and get an idea what is happening with the errors, but even more important is the ability to get onto and focus on the errors before they become overwhelming. In this manner, efforts to examine and determine root cause issues will go a long way towards helping make sure that a backlog of errors doesn’t develop, along with the high blood pressure of the testers trying to make sense of it al.

To complicate matters, if a test fails multiple times, there’s no guarantee that the test failed each time for the same reason(s). There could be different factors coming into play that will skew the results or give false conclusions. Investigating the failures and determining what is really causing the problems is a skill and a black art that will need to be evolved and practiced at a greater rate for all testers going forward, because systems are not getting any simpler!

The Match Game

When we run across an error in software we are testing, depending on the culture, bug reports that are created tend to be very specific about the error, what happened to make it occur, and the environment details that were active when the error occurred. In many cases, the automated counterparts are very often nowhere near that detailed, if they mention much more than just the fact that the test failed. Since many automated tests are looking for match criteria to determine the PASS/FAIL of a test, getting log details is a mission critical aspect of any automation testing scheme.

Microsoft has created a failure database, which as its name implies, has information about each known system failure.  Each test run compares to information in the database, and with that information, if there are matches, a  bug is auto generated and references the issue (sounds cool for system issues that are variations on a theme).

Good Logging Practices

Writing to a log is a common occurrence for automated tests. However, with a little additional effort, that log file data can be a treasure trove of information about the health and well-being of a software product. To leverage the benefits of logging and help design tests that can stand the test of time, he recommends the following:

Logs should be terse on success and verbose on failure: In practice, “noisy” tests are often poorly written tests. Each piece of information recorded to the log should have some purpose in diagnosing an eventual test failure. When a test fails, the test should trace suffi cient information to diagnose the cause of that failure.

When a test fails, trace the successful operation(s) prior to the observed
Failure: Knowing the state of the last good operation helps diagnose where the failure path began.

Logs should trace product information.

Logs should trace information about the product, not information
about the test: It is still a good idea to embed trace statements in automated tests that can aid in debugging, but these statements do not belong in the test results log.

Trace sufficient and helpful failure context: Knowing more about how the failure occurred will assist in diagnosis of the underlying defect. Instead of logging:

Test Failed

Log something like:
Test Failed
Win32BoolAPI with arguments Arg1, Arg2, Arg3
returned 0, expected 1.


Test Failed
Win32BoolAPI with arguments Arg1, Arg2, Arg3
returned 0 and set the last error to 0x57,
expected 1 and 0x0

Avoid logging unnecessary information: Log files do not need to list every single action executed in the test and underlying application. Remember the first rule
above and save the verbose logging for the failure scenarios.

Each test point should record a result when a result has been verified or validated: Tests that aggregate failures often mask defects. If a test is in a fail and continue mode, it is important to know where each failure occurred to diagnose which subsequent failures were dependent and which were independent of the previous

Follow team standards on naming: Standards can help ensure consistency in reading the log files. All object, test, and procedure names should make sense and be non-degenerate (one name for one thing).

Machine Virtualization

One of the best decisions that Tracker Corp. (the company that I worked at from 2005 – 2011) was to put the majority of our test specific environments on a fairly beefy Windows 2008 Server machine, max out the RAM and disk space possibilities and load it to the gills with Virtual machines running Hyper-V. If I, a lone tester, found this setup to be a tremendous blessing, I could only imagine how welcome this type of technology would be to the testing professional in Microsoft, and I’ll likely say they would consider it a blessing for the same reasons that I did.

Ten years ago, I worked for Connectix, the company that first developed Virtual PC, which is in many ways the precursor to Hyper-V. I found the Virtual PC model to be very helpful with setting up test environments, tearing them down, and cloning them for various tests, as well as setting up virtual networks that allowed me to simulate client/server transactions as though they were part of an isolated network. Hyper-V virtual machines allow much the same, and have added considerable enhancement as well.

Virtualization Benefits

The benefits of Virtualization are many, not the least of which is the ability to create store, run, archive, and shuffle an almost limitless number of testing environments.  With Windows 2008 Data-center Server, there is no limit to the number of virtual machines that can run at any given time (well, your RAM, CPU and disk space will certainly impose limits, but for a mid grade server machine, I frequently ran 10 to 15 virtual machines at a given time. The convenience of being able to manage and run up to 15 simultaneous machines is a wonderful convenience. More to the point, this machine being located in a server room and all access to the machines via RDP meant that the footprint for these machines was tiny (as in a single server machine, a dream come true for any tester that has had to maintain multiple physical machines either in his cube, office, or lab.

More to the point, it’s not just running all of these machines simultaneously, it’s also the ability to configure the machines to run on real and virtual networks, create domain controllers and private domains, Virtual Private Networks (VPN’s), and a variety of services running on different machines (web, database, file service, etc.). As mentioned in the last chapter, services are often run on multiple machines as separate components. These components can also run in virtual machines and be tested on a single host server with as many guests as needed to facilitate the services. Configuring and updating the machines can be done, both in real time as well as when the machines are offline.

Outside of the ability to create and deploy test machines rapidly by creating guest machines from a library of pre-configured disk images, the true beauty of Hyper-V is its extensive ability to snapshot virtual machine images, in many cases several at a time. While only one could be run at any given time, I often had machines that had several snapshots that allowed me to test iterative steps, and if any of the steps had a problem, I could restore back one, two, three steps or more, or all the way back to the beginning of the process. While I don’t necessarily recommend this as a regular practice for all virtual machines (relying on too many snapshots can greatly increase the odds of a failure in one of them, and when you lose or have a corrupted snapshot, it’s just gone. Still, even with that risk, the ability to have a safeguard between steps and a quick way to go back to a known state saved countless hours of configuration and set-up time over the past few years. I can only image how huge the savings would be for an organization the size of Microsoft.

Test Scenarios Not Recommended with Virtualization

While virtualization is a fantastic technology and a lifesaver in many situations, it does have some drawbacks. Specifically, most of the environments are optimized for virtualization, and thus are using virtual hardware. Applications that require access to real hardware peripherals in a real time mode will not be able to get this through virtualization. The video modes used are optimized for use via virtual machines, CAD and 3D gaming/simulation is not a good method for using virtual machines, as they will be grossly underpowered. Ultimately, virtual machines are bounded by the servers they are running on, and the number of machines will have to provide the sum of the total CPU, Ram and disk space of the host server. If the user maxes out, the virtual machines likewise max out, there’s nowhere to go but down.

Code Reviews and Inspections

Alan makes the point that even the manuscript for each chapter of HWTSAM goes through multiple hands. The reason is that, no matter how many times he reviews his own work, someone else will see things in a different light or notice something he’s missed, simply because Alan knows the intent of what he’s wanted to write, and thus may totally skim over something that is obvious to anyone else (I suffer through this myopia myself with just about every blog post that I write).

The code review process does the same thing for developers and the code that they write. In the same way that handing off a manuscript to others can help flush out punctuation and grammatical issues, code reviews can flush out bugs in code before it is even compiled.

Microsoft puts a lot of emphasis on code review and having testers get involved in the process (note: Alan’s talk at the 2010 Pacific Northwest Software Quality Conference was specifically about software testers taking part in the code review process.

Types of Code Reviews

Code reviews can range anywhere from informal quick checks to in depth, very specific review sessions with a team of engineers. Both share many attributes, and while one is less rigorous than the other, they still aspire to the same thing, to use criteria to determine if code is accurate, solid and well constructed.

Formal Reviews

Fagan inspections (named after the inventor of the process, Michael Fagan) are the most formal code reviews performed. A group of people are selected to review the code, with very specific roles and processes that need to be adhered to. Formal meetings with roles assigned to each participant are hallmarks of the method.

Those participating are expected to have already read and pre-reviewed the code. As you might guess, these inspections take a lot of time and manpower to accomplish, but there are very effective when harnessed correctly. While this method is effective, the intensely time consuming aspect of it is actually part of the reason why the technique is not widely used at Microsoft.

Informal Reviews

The big challenge with informal reviews are that, while they are indeed faster, they also are not as comprehensive and may not hit many of the deeply entrenched bugs the way that a more formal process would. Alan showcases a number of methods utilized, ranging from peer programming sessions with just review in mind, to email round-robins where code is inspected on and commented on, to over the shoulder checks. All of them have their pluses and minuses, and in most cases, the trade-off for thorough process is time spent to actually do it.


Checklists have the ability to focus reviewers on areas that might be considered the most important, or otherwise areas that need to be covered so that a thorough job is done. An example checklist provided by Alan follows below:

  • Functionality Check (Correctness)
  • Testability
  • Check Errors and Handle Errors Correctly
  • Resources Management
  • Thread Safe (Sync, Reentry, Timing)
  • Simplicity/Maintainability
  • Security (INT Overflow, Buffer Overruns, Type Mismatches)
  • Run-Time Performance
  • Input Validation

Taking Action

In addition to the number of issues found by activity, it can be beneficial to know what kinds of issues code reviews are finding. Here is an example of some rework options found during various Microsoft code reviews, as well as some steps to help find these issues prior to code review.

Duplicate Code: For example, re-implementing code that is available in a common library

Educate development team on available libraries
and their use: hold weekly discussions or presentations demonstrating capabilities of libraries.

Design issue:  for example, the design of the implementation is suboptimal or does not solve the coding problem as efficiently as necessary

Functional issue:  for example, the implementation contains a bug or is missing part of the functionality (omission error)

Spelling errors:  Implement spell checking in the integrated development environment (IDE).

Time Is on My Side

While code reviews are important, it’s also important to consider the time impact of conducting them. Guessing how much time it takes is usually wildly inaccurate, so using mechanisms that actually help to literally measure how long it takes to review code could prove to be very beneficial.  The duration of time that it takes to review code can be very subjective, depending on the complexity of the systems being reviewed, but the following questions can help focus the time requirements:

  • What percentage of our time did we spend on code review versus code implementation?
  • How much time do we spend reviewing per thousand lines of code (KLOC)?
  • What percentage of our time was spent on code reviews this release versus the last release?
  • What is the ratio of time spent to errors discovered (issues per review hour)?
  • What is the ratio of issues per review hour to bugs found by test or customer?

More Review Collateral

Alan makes the case that there is a lot of communication that goes on in code reviews. Rather than verbalize the majority of it, it would be helpful to capture all of that feedback. In many cases, unless that feedback is captured some way, it’s lost in the ether. Formal code review tools can help with this process, incorporating email comments, or even just marking up the code and being verbose with commentary can prove to be very helpful.

Two Faces of Review

More than just providing the opportunity to review and refactor code, reviews also give everyone on the team a better chance to learn the functionality and underlying calls throughout the system in a comprehensive manner. In short, the biggest benefit of code reviews beyond finding bugs is in educating the development team and test team as to what the code actually contains

Tools, Tools, Everywhere

There’s a double edge sword in having so many software developers and SDET’s at Microsoft.

The good news, there’s a huge number of tools to help with situations developers and testers might face (lots of options and choices for specific methods and implementations.

The bad news, there’s a huge number of tools to help with situations developers and testers might face (potentially too many choices makes it hard to determine which tool would be appropriate for specific methods and implementations. To this end, there are dozens of automation frameworks out in the wild at Microsoft, and each has a different purpose. Knowing which tool to use and when is not a chance encounter, users have to know what they are doing and when.

Reduce, Reuse, Recycle

Alan makes the point that Microsoft Office, before it became Office, was a collection of separate tools (I remember distinctly the days when we would buy Access, Excel, and Power Point and Word as separate applications to install on systems. When the decision was made to bundle these applications together, along with Outlook and create the Microsoft Office suite, it was discovered that many functions were repeated or operated very similarly throughout the respective applications. Rather than maintain multiple versions of code, the decision was made to create a dedicated library of functions that all of the tools would utilize together. By doing this, it simplified the coupling of applications to one another and made it possible to experience a similar look and feel and to conduct transactions in a similar manner between tools.

Additionally, the greater benefit is the fact that, for testers, there is less code to have to wade through and fewer specific test cases required; the shared library allowed for a more efficient use of code modules.

This system works great when code is being shared between product lines. The challenge comes when trying to do the same thing between functional groups looking to create test harnesses and framework. Without an up-front decision to consolidate and work on frameworks together, there isn’t much benefit for functional groups to consult with one another as to how they are each doing their testing and what code they are writing to do that.

What’s the Problem?

Generally speaking, if separate groups are working on different goals, there may be no benefit at all to making sure that automated tests and frameworks are standardized. In other context’s though, it may prove to be very helpful, in that it will allow for organizations to be more efficient and make better use of the respective tools. What’s more, development of tests may well prove to go faster because all testers are “on the same page” and understand both the benefits and limitations of specific systems.

There is no question that the challenges for testing are going to grow rather than shrink in size over the coming years. The big question then is “what can the testers do to help surf the waves rather than get crushed by them?” Taking advantage of the tools and infrastructure options open to them will go a long way in helping to make it possible for the testers and developers to keep abreast of the furious pace of development, and utilizing tools like virtualization, code reviews, and failure analysis will help testers and developers quickly deploy environments, gain  a better understanding of the code being created, and more quickly respond to the errors and failures that are the result of continuous software development.

No comments: