Monday, December 13, 2010

BOOK CLUB: How We Test Software At Microsoft (7/16)

This is the seventh installment of the TESTHEAD book club covering How We Test Software at Microsoft. This is a continuation of Section 2, which deals with the philosophy of testing and the testing concepts that Microsoft embraces and teaches its SDE’s and SDET’s. This section covers the area of Analyzing Risk and Code Complexity Testing, and again, the density and detail provided could take several blog posts to cover. Rather than give a run down and bit by bit summary of the entire chapter, this is a high level view and will naturally only be able to skim the surface.

Chapter 7. Analyzing Risk with Code Complexity

Alan starts out this chapter with a story of his uncle and how he has fished the rivers of Montana during his life, and the vast knowledge he has of the sport. Not only is he an expert with rods, line and bait choices, but also with an encyclopedic knowledge of the rivers and the conditions that would be conducive to catching fish. Alan points out that for all of his uncle's skill, it will go to waste if he fishes in areas where there just aren't likely to be any fish.

When we test we have a similar issue. We may have lots of great tools, lots of great insights, and we may even have code that has bugs by the barrel, but we aren't going to find them if we don't know where to look. Likewise, we will waste time pounding through super effective tests in areas where there just aren't any bugs. Since there is no such thing as the ability to do exhaustive testing in all cases (come on, raise your hands if you have had a developer tell you "just test everything" and you have not stifled a laugh at hearing that; I know it's not just me!)

Boundary-value analysis, pairwise testing, and other methods help us to minimize the number of test cases we need to run, yet still provide us with the maximal ability to find bugs. At least that's the idea, and it would work great if we had a code base where bugs were evenly distributed. Alas, that's not the case, and just like Alan's uncle knowing that deeper and calmer areas with submerged logs and water plants are more likely to have fish, certain areas of code are likewise more likely to have bugs than others. The ability to determine where bugs will be and then find them takes a bit of skill and a bit of understanding the code like a river (if I may torture the metaphor just a little longer) so as to anticipate where pockets of bugs will be, and then target our test cases to focus there.

Risky Business

Risk-based testing is specifically when we apply our testing acumen in a way that allows us to mitigate or minimize risks. In all testing there are risks, the biggest being there just isn't enough time or ability to test everything. With a limited pie when it comes to testing time, prioritizing is vital, and just as vital is prioritizing those tests that will provide the maximum amount of coverage and ferret out the most bugs in the shortest amount of time. A famous and oft quoted (and perhaps misquoted) principle is credited to Vilfredo Pareto. In Pareto's time, he realized that 80% of the wealth of the society belonged to 20% of the people. This has morphed over time into something we refer to in software development as the "Pareto Principle", which generally say that 80% of the results come from 20% of the causes. Some examples stated:

  • 80 percent of the users will use 20 percent of the functionality
  • 80 percent of the bugs will be in 20 percent of the product
  • 80 percent of the execution time occurs in 20 percent of the code

The Pareto Principle is often used as a basis for risk-based testing. Often the goal is to focus on finding these 20% points to get the 80% results. One aspect of a risk-based approach attempts to classify which portion of the product contains these popular user scenarios and then focuses testing towards those areas of the product. The flip side to this is that there are customers who most certainly will use the features that fall outside of the 20% range, so only focusing on the preferred areas will miss the features that fall outside of the 20%. Another approach is to write more tests for the parts of the product that would be more likely to have bugs. The rub, of course, is making sure that you know where those areas actually are.

A Complex Problem

Alan uses the example of cooking to demonstrate the notion of simple vs. complex and the potential for mistakes when the recipe requires many steps and many interactions. Complex code and the interactions with those pieces are likewise more prone to mistakes. By seeking out areas where the code is more complicated, the odds go up that there will be more bugs found in those sections. As Alan points out, complicated code is also much more difficult to maintain. In addition, the more complicated the code, the more difficult the code will be to test.

So what do we use to tell us if we are looking at a scary section of code? We go with our gut feelings at times, or by reviewing code to determine if we are looking at a difficult section of code. A phrase used by Agile teams is “Code smell”, which describes code that might be too complex based on large functions or many dependencies.

Alan shared a story from when the was working with the Windows 95 team and he was testing network components. A number of issues would be marked as “Won’t fix” because the code was deemed to be "too scary to touch", compounded by the fact that the original developer had left some years earlier. Every one of the developers was terrified of making changes and introducing more bugs.

When dealing with issues of simplicity or complexity, there are a few objective measurements that a tester can make to get an idea of what they might be getting into.

Lines of Code (LOC)

Applications with 1,000 lines of code are typically less complex than applications with 10,000 lines of code in them. The difference isn’t just a multiplier, either. There is likely to be way more than 10 times the bugs in 10,000 lines of code as compared to 1,000. The problem with using lines of code as a metric is that there are different ways to count them.

Counting Lines of Code

How many lines of code are in a program? How can such a simple question be so difficult to answer? Let’s start with something simple:

if (x < 0)
i = 1;
i = 2;

The preceding excerpt contains four lines. How many does it contain if formatted like this?

if (x < 0) i = 1;
else i = 2;

You would have a hard time counting the second example as four lines—it is two lines, or at least it is formatted as two lines. Personally, I don’t like the formatting of either example. Given the same code, I’d prefer to write it like this:

if (x < 0)
i = 1;
i = 2;

So, is this example two, four, or eight lines long? The answer depends on whom you ask.

 Measuring Cyclomatic Complexity

The more decisions a program has to make, the more likely an issue will be found. Determining how many decisions there are in a program can be helped by using “cyclomatic complexity”. Cyclomatic complexity looks at the number of decision points in a function and how many paths can be followed based on those decisions. A function with no conditional statements, loops, etc. has just one “linearly independent” path through the program. Conditional statements add optional paths to the workflow and create additional paths to follow.

Cyclomatic complexity becomes an issue as the decision points and branches get larger. It’s harder to keep multiple challenges straight, and the programmer has the potential to drop the ball in many more places as the cyclomatic complexity rises. Additionally, high numbers of decision points make for code that is more difficult and time consuming to test.

 For managed code development, several free tools are on the market that calculate the cyclomatic complexity of a given piece of code, including Sourcemonitor, Reflector, and FxCop.

 A primary use of cyclomatic complexity is as a measurement of the testability of a function.

Cyclomatic complexity Associated risk:

  • 1–10 Simple program with little risk
  • 11–20 Moderate complexity and risk
  • 21–50 High complexity and risk
  • 50+ Very high risk/untestable

Halstead Metrics

Halstead metrics are an entirely different complexity metric based on the following four measurements of syntax elements in a program:

  • Number of unique operators (n1)
  • Number of unique operands (n2)
  • Total occurrences of operators (N1)
  • Total occurrences of operands (N2)

With these values, we can also make a determination of just how much complexity we will be dealing with.

  • Code length is calculated by by adding N1 and N2.
  • Difficulty metric is determined using the formula: (n1 / 2) * (N2 / n2)

Halstead metrics can help to determine the level of difficulty in maintaining a program, but there are many factors that can influence those figures. Translation: doing some math on the number of functions, paths, operations and operands will give a rough idea, but will not answer everything. They are tools, it takes a skilled craftsman to use them effectively.

Object-Oriented Metrics

Object oriented programming has some unique quirks of its own, and as such, there are Object-oriented metrics and class structure to evaluate in languages where OOP/OOD is used (C++, Java, C#, etc.). One such system is known as CK metrics (named after the two who describe them, Chidamber and Kemerer.

CK metrics include the following:

  • Weighted methods per class (WMC) Number of methods in a class
  • Depth of inheritance tree (DIT) Number of classes a class inherits from
  • Coupling between object classes (CBO) Number of times a class uses methods or instance variables from another class

As can be seen in structural programming where large functions and numerous decision points create a large and hard-to-wrap-ones-head-around system, object-oriented metrics point to classes with lots of methods, deep inheritance trees, or excessive coupling as being more difficult to test and maintain, and are likewise more prone to contain defects.

Another method used in object oriented programming is “fan-in” and “fan-out”. This is a measurement of how many classes call into a specific class and how many classes are called from a specific class. Knowing what classes call other classes, and what classes have dependency on other classes helps to give the tester an idea of other components that must be tested. In this case especially, change one, and you may be creating instability in a dozen other places.

Fan-in and fan-out measurements can also be used with structured programming or scripted languages. When a function gets called by many other functions, changes to it can cause large stability issues. Alan points to the Windows application API:

Many core Windows functions are called by thousands of applications. Even the most trivial change to any of these functions has the potential to cause one of the calling functions to suddenly fail. Great care must be taken during maintenance of any function or module with a high fan-in measurement. From the testing perspective, this is a great place to be proactive. Determine as early as possible which functions, modules, or classes of your application will have the highest fan-in measurement, and concentrate testing efforts in that area.

Fan-out measurements tell you how many dependencies you have. If you have a function or class that calls 10 or 20 methods, this means there are 10 or 20 different methods where changes might affect you.

The Windows Sustained Engineering (SE) Team is responsible for all of the ongoing maintenance of the released versions of Windows. This includes hotfixes, security patches, updates (critical and noncritical), security rollups, feature packs, and service packs.

Whenever a hotfix is released, the SE team must decide how many of the 4,000+ binaries that make up Windows they should test. With a hotfix, they have a very short amount of time to ensure that a change in one binary doesn’t affect some other binary, but they don’t have time to test each and every binary in Windows for every hotfix. They use a combination of complexity metrics to do a risk ranking of each binary based on the changes being made for a particular hotfix as well as the overall historical failure likelihood of a binary.

Once ranked, they take a very conservative approach and eliminate the bottom 30 percent of the least risky binaries from the regression test pass. This means they don’t have to spend time running tests for more than 1,000 binaries and can concentrate their testing on the remaining higher-risk binaries. As they hone their process, they will be able to eliminate even more binaries, increasing the test efficiency while remaining confident that the changes being made don’t cause an undetected regression.

—Koushik Rajaram

High Cyclomatic Complexity Doesn’t Necessarily Mean "Buggy"

Quantifying complexity does not necessarily equate to an area being particularly buggy, and going in with the idea that high complexity means that lots of bugs will be found is a mistake and may cost the tester many hours of fruitless searching. Alan uses the example of a smoke alarm here. When the alarm goes off, it doesn’t necessarily indicate that there is a fire, but it may very well indicate that there *might* be a fire and that you should investigate (and in the case of a fire alarm, quickly!). Similarly, when looking at complex systems, the code may or may not be buggy, but just like that smoke alarm, you will want to make sure that you take a closer look.

The example of switch statements that are driven by menu interfaces is used, and with a relatively simple program like MS Paint, there can be many different branches to investigate just based on the menu option selected. High complexity, but also fairly straightforward to test based on its implementation.

What to Do with Complexity Metrics

It’s entirely possible that we can go through and grok all of these different complexity metrics, plus some others, get a really good idea of the complexity of the code, and have little in the way of bugs to show for our efforts.

How can we help to make sure that we aren’t spinning our wheels, or going overboard at analyzing areas with limited bug potential? One approach is to combine the methods. If several different complexity metrics all classify a function, module, or file as being highly complex, it’s a good bet that this will be an area to explore more or pay closer attention to.  If only one measurement shows a high complexity, but others do not, then weigh that in when considering whether or not the area contains “lots of fish” or is a little more deceptive and may not be as rich as we think.

If code is newly developed, high complexity may point to a need for refactoring. Alan points out that some teams at Microsoft experiment with setting limits on the level of cyclomatic complexity allowed for new functionality. Of course, they have also discovered that, by itself, cyclomatic complexity isn’t the only indicator of code being in need of refactoring. Often the code is just fine as is. Using a combination of metrics tends to give the clearest picture.

In the end, code complexity can be both a boon and a bane to testing. It’s a helpful metric at times, but it can tell a misleading story when taken in isolation. High complexity tells you only that the code *might* be more prone to being buggy. It takes a sapient tester to go in and find out for sure.

Chapter 8 will be our next installment. Stay tuned.

No comments: