This is the third part of Section 3 in “How We Test Software at Microsoft”. This was a meaty chapter to get through, and as such, it took me longer to finish this one. This chapter focuses on Non-Functional Testing, and how Microsoft approaches a number of these "fuzzier" testing areas... and yes, even "fuzz" testing (more on that later :) ). Note, as in previous chapter reviews, Red Text means that the section in question is verbatim (or almost verbatim) as to what is printed in the actual book (and in this section, to preserve context, there's a lot of it this time around.
Chapter 11: Non-Functional Testing
Alan starts out the chapter with a story that rings true to just about any tester out there. Your team goes along and develops a product that looks to be performing excellently in all areas, and all of the details look to be going smoothly. So smoothly that you are confident on release that this product is going to rock, and it does… for the first couple of weeks of deployment. But then the inevitable calls come in. the application’s performance has degraded to the point where it is unusable. Further research shows that there are a number of memory leaks, and that they would have been found if the platform and application had been tested for an extended period of time, but it hadn’t. Instead, nightly builds and short bursts of testing were the norm, and this issue slipped through.
So when we see a phrase like “non-functional testing", what are we talking about? Functional testing focuses on positive and negative testing (the inputs and outputs) for the functionality of the application under test. Non-functional testing focuses on everything else. Performance testing, load testing, security testing, reliability testing… I tend to use the term para-functional testing to describe these types of tests (with a tip of the hat to Cem Kaner and James Bach for introducing the term to me). Alan further describes these tests as those where direct and exact measurement is either impossible or rather difficult.
The International Organization for Standardization (ISO) defines several non-functional attributes in ISO 9216 and in ISO 25000:2005. These attributes include the following:
Reliability: Software users expect their software to run without fault. Reliability is a measure of how well software maintains its functionality in mainstream or unexpected situations. It also sometimes includes the ability of the application to recover from a fault. The feature that enables an application to automatically save the active document periodically could be considered a reliability feature. Reliability is a serious topic at Microsoft and is one of the pillars of the Trustworthy Computing Initiative (http://www.microsoft.com/mscorp/twc).
Usability: Even a program with zero defects will be worthless if the user cannot figure out how to use it. Usability measures how easy it is for users of the software to learn and control the application to accomplish whatever they need to do. Usability studies, customer feedback, and examination of error messages and other dialogues all support usability.
Maintainability: Maintainability describes the effort needed to make changes in software without causing errors. Product code and test code alike must be highly maintainable. Knowledge of the code base by the team members, testability, and complexity all contribute to this attribute. (Testability is discussed in Chapter 4, “A Practical Approach to Test Case Design,” and complexity is discussed in Chapter 7, “Analyzing Risk with Code Complexity.”)
Portability: Microsoft Windows NT 3.1 ran on four different processor families. At that time, portability of code was a major requirement in the Windows division. Even today, Windows code must run on both 32-bit and 64-bit processors. A number of Microsoft products also run on both Windows and Macintosh platforms. Portable code for both products and tests is crucial for many Microsoft organizations.
Testing the “ilities”
Alan focuses on a number of tests that fall into what he refers to as the “ilities” section of testing. Some of the ilities listed are:
Security (or Securability to keep the meme going)
Microsoft has developed specialized teams to focus on many of these “ilities”, specifically in the area of usability, where an entire team is dedicated to running tests and finding ways to innovate the process. Alan focuses on some key areas where he feels Microsoft has led the way or added specific innovations to the process of non-functional testing.
Performance testing is said to encompass lots of different disciplines. Stress, load, and scalability testing are often mentioned in the same breath or are implied to be the same things as performance testing. Indeed, long term testing, stress testing, and load testing are elements that are applied to help ensure that a system is performing optimally.
Some of us might recall using a stopwatch to test how long certain pages load or certain transaction take to process (heck, some of us may still do exactly that in some instances today… OK, guilty confession time, I know I certainly do this :) ). It’s not practical at scale, and in general scripts allow the capability of timing interactions so that they can be recorded on a broader scale.
Performance testing tends to focus on one key attribute... "where are the bottlenecks"? If the flow of information, or the workflow unexpectedly slows down, pinpointing those bottleneck areas and trying to determine what is causing the bottleneck (and fix it) is of primary importance.
How Do You Measure Performance?
Performance issues are often best dealt with early in the life of a product, when tweaks to workflows or performance characteristics will have less effect of spilling over into other areas. Planning ahead to deal with performance criteria is important to helping make sure that the product meets the performance criteria it was designed to meet.
Some methods to help determine the performance for a project or product:
Ask Questions: Identify areas that have potential performance problems. Ask about network traffic, memory management, database design, or any other areas that are relevant. Even if you don’t have the performance design solution, testers can make a big impact by making other team members think about performance.
Think About The Big Picture: Think about full scenarios rather than individual optimizations. You will have time to dig into granular performance scenarios throughout development, but time during the design is better spent thinking about end-to-end scenarios.
Set Clear, Unambiguous Goals: Goals such as “Response time should be quick” are impossible to measure. Apply SMART (specific, measurable, achievable, relevant, timebound) criteria to the design goals. For example, “Execution of every user action must return application control to the user within 100 milliseconds, or within 10 percent of the previous version, whichever is longer.”
Here are some helpful tips for performance testing.
Establish a Baseline: An important aspect of defining and measuring early is establishing baselines. If performance testing starts late in the project, it is difficult to determine when any discovered performance bottlenecks were introduced.
Run Tests Often: Once you have a baseline, measure as often as possible. Measuring often is a tremendous aid in helping to diagnose exactly which code changes are contributing to performance degradation.
Measure Responsiveness: Users don’t care how long an underlying function takes to execute. What they care about is how responsive the application is. Performance tests should focus on measuring responsiveness to the user, regardless of how long the operation takes.
Measure Performance: It is tempting to mix functionality (or other types of testing) in a performance test suite. Concentrate performance tests on measuring performance.
Take Advantage of Performance Tests: The alternate side of the previous bullet is that performance tests are often useful in other testing situations. Use automated performance tests in other automated test suites whenever possible (for instance, in the stress test suite).
Anticipate Bottlenecks: Target performance tests on areas where latency can occur, such as file and print I/O, memory functions, network operations, or any other areas where unresponsive behavior can occur.
Use Tools: In conjunction with the preceding bullet, use tools that simulate network or I/O latency to determine the performance characteristics of the application under test in adverse situations.
Remember That Resource Utilization is Important: Response time and latency are both key indicators of performance, but don’t forget to monitor the load on CPU, disk or network I/O, and memory during your performance tests. For example, if you are testing a media player, in addition to responsiveness, you might want to monitor network I/O and CPU usage to ensure that the resource usage of the application does not cause adverse behaviors in other applications.
Use “Clean Machines” and Don’t...: Partition your performance testing between clean machines (new installations of the operating systems and application under test) and computer configurations based on customer profiles. Clean machines are useful to generate consistent numbers, but those numbers can be misleading if performance is adversely affected by other applications, add-ins, or other extensions. Running performance tests on the clean machine will generate your best numbers, but running tests on a machine full of software will generate numbers closer to what your customers will see.
Avoid Change: Resist the urge to tweak (or overhaul) your performance tests. The less the tests change, the more accurate the data will be over the long term.
Performance counters are often useful in identifying performance bottlenecks in the system. Performance counters are granular measurements that reveal some performance aspect of the application or system, and they enable monitoring and analysis of these aspects. All versions of the Windows operating system include a tool (Perfmon.exe) for monitoring these performance counters, and Windows includes performance counters for many areas where bottlenecks exist, such as CPU, disk I/O, network I/O, memory statistics, and resource usage.
Many applications have methods that allow the tester to check key performance criteria early and often. This criteria can help the product team determine if they are indeed on the right track. For more information, check The Patterns & Practices Performance Testing Guidance Project and http://msdn.microsoft.com/en-us/library/bb924375.aspx.
The ability of an application to perform under expected and heavy load conditions, as well as the ability to handle increased capacity, is an area that often falls under the umbrella of performance testing. Stress testing is a generic term that often includes load testing, mean time between failure (MTBF) testing, low-resource testing, capacity testing, or repetition testing.
The main differences between the approaches and goals of these different types of testing are described here:
Stress Testing: Generally, the goal of stress testing is to simulate larger-than-expected workloads to expose bugs that occur only under peak load conditions. Stress testing attempts to find the weak points in an application. Memory leaks, race conditions, lock collision between threads or rows in a database, and other synchronization issues are some of the common bugs unearthed by stress testing.
Load Testing: Load testing intends to find out what happens to the system or application under test when peak or even higher than normal levels of activity occur. For example, a load test for a Web service might attempt to simulate thousands of users connecting and using the service at one time. Performance testing typically includes measuring response time under peak expected loads.
Mean Time Between Failure (MTBF) Testing: MTBF testing measures the average amount of time a system or application runs before an error or crash occurs. There are several flavors of this type of test, including mean time to failure (MTTF) and mean time to crash (MTTC). There are technical differences between the terms, but in practice, these are often used interchangeably.
Low-Resource Testing: Low-resource testing determines what happens when the system is low or depleted of a critical resource such as physical memory, hard disk space, or other system-defined resources. It is important, for example, to understand what will happen when an application attempts to save a file to a location that does not have enough storage space available to store the file, or what happens when an attempt to allocate additional memory for an application fails.
Capacity Testing: Closely related to load testing, capacity testing is typically used for server or services testing. The goal of capacity testing is to determine the maximum users a computer or set of computers can support. Capacity models are often built out of capacity testing data so that Operations can plan when to increase system capacity by either adding more resources such as RAM, CPU, and disk or just adding another computer.
Repetition Testing: Repetition testing is a simple, brute force technique of determining the effect of repeating a function or scenario. The essence of this technique is to run a test in a loop until reaching a specified limit or threshold, or until an undesirable action occurs. For example, a particular action might leak 20 bytes of memory. This isn’t enough to cause any problems elsewhere in the application, but if the test runs 2,000 times in a row, the leak grows to 40,000 bytes. If the function provides core functionality that is called often, this test could catch a memory leak that might become noticeable only after the application has run for an extended period of time. There are usually better ways to find memory leaks, but on occasion, this brute force method can be effective.
Distributed Stress Testing
Microsoft makes extensive use of the infrastructure in their offices and even individual computers to help with running stress testing. While stress testing occurs may run for 3 to 5 straight days, a lot of stress testing takes place during the time period between when employees leave for the evening and return the next day. Everyone takes part in volunteering their computers to run the overnight stress tests. All groups and all users donate their machines to the cause (the actual application or method varies from group to group and project to project).
The Windows Stress Team
The Windows Stress Team does not have a lab filled with hundreds of computers. Instead, they rely on the numerous users of Windows systems within Microsoft to volunteer their computers to the stress testing effort. Every day, dozens or more stress failures are reported from the thousands of computers that run the stress tests. The Windows Stress team doesn’t “own” the computers running the stress tests, so it’s important that the issues on these computers are debugged quickly so that the owners can use the computers for their daily work.
Attributes of Multiclient Stress Tests
Stress tests written for a large distributed stress system have many of the same quality goals as typical automated tests do, but also have some unique attributes:
Run Infinitely: Stress tests ordinarily run forever, or until signaled to end. The standard implementation procedure for this is for the test to respond efficiently to a WM_CLOSE message so that the server can run tests for varying lengths of time.
Memory Usage: Memory leaks by a test during stress typically manifest as failures in other tests resulting from a lack of resources. Ideally, stress tests do not leak memory. Because tests will run concurrently with several other tests, excessive use of processes, threads, or other system resources is also a practice to avoid.
No Known Failures: On many teams, it is a requirement that all tests are run anywhere from 24 hours to a week on a private computer with no failures before being added to the stress mix. The goal of the nightly stress run is to determine what the application or operating system will do when a variety of actions run in parallel and in varying sequences. If one test causes the same failure to occur on every single stress client computer, no new issues will be found, and an entire night of stress will have been wasted.
Interoperability and compatibility is a big concern for Microsoft, due to the large number of programs and applications that Microsoft actively supports. Each new new release of Windows adds additional functionality, while at the same time looking to support applications designed for previous versions of Windows. Application compatibility comes into play with a lot of Microsoft products (Windows, Office, Visual Studio and Internet Explorer being key examples). Microsoft Internet Explorer must continue to support relevant plug-ins or other add-on capabilities.
Microsoft Application Verifier is a key tool to help Microsoft determine if applications behaving appropriately. Application Verifier can help determine if if the system is incorrectly checking for the Windows version, trying to assume administrative rights, or any of dozens of programming errors. Memory leaks, memory corruption, and invalid handle usage can also be spotted with the Application Verifier.
Eating Our Dogfood
I mean, we talk at Microsoft about this notion we call “eating our own dog food.” You’re supposed to eat your own dog food before you serve it to anybody else.
—Steve Ballmer, October 21, 2003, Office System Launch
This is a phrase that I grew to appreciate and understand when I worked at Cisco Systems, as we often had engineers and testers reside primarily on an Alpha based network that was often termed the “dogfood” network, where the latest and greatest )or not so great at times) code would be implemented. We’d feel the pain points before our customers did, and therefore we would often resolve many issues before our customers would be using the same code.
Microsoft follows the same philosophy, in that for many of its products, their developers and testers are also the first users. Microsoft eats its own dog food. Everyone on the Windows team uses daily or weekly builds of the in-development version of the operating system as their working O.S. The Visual Studio team develops their product using Visual Studio. The Office team uses the latest builds of Office to write specifications, deliver presentations, send and receive e-mail, etc. Alan also relates that, when was part of the Windows CE team, he would use the dogfood versions of code to run his office phone, cell phone, home wireless router, etc.
So what would be the disadvantage of this? In some ways, the development engineers would see and feel the pain points that the customers would, but that’s assuming that the engineers would use the product in the same way that the intended customers would, and that’s not always the case. What’s more, engineers working on earlier versions of code will have a tolerance for oddities, conscious or not, that a regular customer would not, or they may use more advance features on a regular basis while not focusing on more standard features that would mean more to a particular customer.
Often this problem can be overcome by having beta testers use the code and work through their own scenarios and processes. Engineers may focus on one particular type of document and have no exposure to (or even any idea) about how some users of the product might intend to use the product. Microsoft also utilizes individuals in the company that fulfill a diverse range of job functions to help see the product from that particular domain’s point of view (accountants, lawyers, operations, shipping & receiving, etc). By having these diverse groups within the company, as well as outside the company, Microsoft can see how their early versions and later release candidate versions perform under real world scenarios and do not rely exclusively on the engineering teams and their own patterns of use.
Accessibility is about removing barriers and providing the benefits of technology for everyone.
— Steve Ballmer
The goal of providing access to the same information and tools to complete their work is key to accessibility testing. Those with special needs and challenges with accessing computers (sight impaired, hearing impaired, or other disabilities) are part of Microsoft’s ongoing effort to help ensure that they also have the ability to do many if not all of the tasks that a regular use would be able to accomplish.
The United States federal government is one of Microsoft’s largest customers, and it requires that their information technologies take into account the needs of all users. In 1998, Section 508 (http://www.section508.gov) of the Rehabilitation Act focused efforts to create opportunities for people with disabilities. To this end, Microsoft supports Section 508.
Below are some examples of how Microsoft looks at accessibility issues related to its products:
Operating System Settings: Operating system settings include settings such as large fonts, high dots per inch (DPI), high-contrast themes, cursor blink rate, StickyKeys, FilterKeys, MouseKeys, SerialKeys, ToggleKeys, screen resolution, custom mouse settings, and input from on-screen keyboards.
“Built-in” Accessibility Features: Built-in features include features and functionality such as tab order, hotkeys, and shortcut keys.
Programmatic Access: Programmatic access includes implementation of Microsoft Active Accessibility (MSAA) or any related object model that enables accessibility features.
Accessible Technology Tools: Testing of applications using accessibility tools such as screen readers, magnifiers, speech recognition, or other input programs is an important aspect of accessibility testing. Microsoft maintains an accessibility lab, open to all employees, filled with computers installed with accessibility software such as screen readers and Braille readers.
One of the tools that Microsoft uses to focus attention on disability issues is the creation and use of “Accessibility Personas”. These personas are users created to represent various customer segments and the ways those customers would use the product(s). These personals often span all of Microsoft’s product lines. Personas were created for users who were blind, and these personas focused on the abilities of screen readers and determined areas where there were shortcomings (text embedded in images, for example). The persona for deaf or hard of hearing users helped focus efforts on sounds, their volume range, and the use of other types of alerts where the sounds wouldn’t be effective.
Alan shares a story about how his own preference for using keyboard shortcuts rather than mouse movements focused attention on a product, and how his efforts were being resisted by the project lead. Rather than just keep focusing on the issue and arguing about it to no resolution, him and his team decided to try a different approach. They went in and disconnected the mouse from the team lead’s system and explained that he could have the mouse back after he lived for a while with the usability issues surrounding the keyboard. Needless to say, after that experience (fortunately the lead had a good enough sense of humor to go with it and see for himself the issues) they reached an understanding of the importance of many of the keyboard shortcut issues and resolved them.
Testing for Accessibility
Some of the methods that were devised by creating user personas have helped inform product development across all of Microsoft’s product lines. Here are some examples:
Respect System-Wide Accessibility Settings: Verify that the application does not use any custom settings for window colors, text sizes, or other elements customizable by global accessibility settings.
Support High-Contrast Mode: Verify that the application can be used in high-contrast mode.
Realize Size Matters: Fixed font sizes or small mouse targets are both potential accessibility issues.
Note Audio Features: If an application uses audio to signal an event (for example, that a new e-mail message has arrived), the application should also allow a nonaudio notification such as a cursor change. If the application includes an audio tutorial or video presentation, also provide a text transcription.
Enable Programmatic Access to UI Elements and Text: Although this sounds like a testability feature (enable automation tools), programmatic access through Active Accessibility or the .NET UIAutomation class is the primary way that screen readers and similar accessibility features work.
Testing Tools for Microsoft Active Accessibility
The Active Accessibility software development kit (SDK) includes several worthwhile tools for testing accessibility in an application, particularly applications or controls that implement Microsoft Active Accessibility (MSAA).
With the Accessible Explorer program, you can examine the IAccessible properties of objects and observe relationships between different controls.
With the Accessible Event Watcher (AccEvent) tool, developers and testers can validate that the user interface (UI) elements of an application raise proper Active Accessibility events when the UI changes.
With the Inspect Objects tool, developers and testers can examine the IAccessible property values of the UI items of an application and navigate to other objects.
The MsaaVerify tool verifies whether properties and methods of a control’s IAccessible interface meet the guidelines outlined in the specification for MSAA. MsaaVerify is available in binary and source code forms at CodePlex (http://www.codeplex.com). Whether you are satisfying government regulations or just trying to make your software appeal to more users, accessibility testing is crucial.
Usability and accessibility are often seen as the same thing, but they are really two different disciplines. Accessibility focuses on the ability to maximize the ability of the most people being able to access and use the User Interface. Usability focuses on the understandability and interaction with the User Interface Documentation that is helpful and direct, tool tips, features that are well laid out and that, using Steve Krug’s famous book title, answer the users ultimate focus of “Don’t Make Me Think” are all important to the methods used to help develop software that is truly usable.
The phrase “Don’t Make Me Think” is not meant to be insulting. In fact, it’s become one of my catch phrases when I speak to developers about a feature and the way its laid out on the screen. The more I have to ferret around the application to figure out how to use it, the less likely I am to want to use it. In short, I don’t mind having to think about the work I’m doing and the tasks that need to be accomplished. At the same time, I don’t want to fight the application to figure out how to do those steps. My goal as an end user is to accomplish the tasks, and the fewer roadblocks the application throws up in that process, the better.
One of the methods that Microsoft uses to help them determine usability issues is the frequent use of Usability Labs. These labs go outside of the usual testers to get other people involved in using a feature and to “find the pain points” for those users. Seeing those areas can both help the development and project managers understand the system better and areas they may not have considered as strongly, but it can also provide valuable clues to the testers as to areas that they can test more effectively and add usability criteria and usage patterns to various test cases.
Common questions are approached in these studies, such as:
What are the users’ needs?
What design with solve the users’ problems?
What tasks will users need to perform, and how well are users able to solve them?
How do users learn, and then retain their skills with the software?
Is the software fun to use?
Security testing is a huge topic, and several books have been written on the topic. Suffice it to say that Microsoft has such a large presence in so many networks and businesses that their Operating Systems are frequent targets for hackers to try to get around security that is in place. To that end, Microsoft places a strong emphasis on security testing. Malicious software, Trojans, adware and other types of malware have become increasingly prevalent over the last several years. The tester isn’t just looking for security bugs, they are also trying to determine how the security bug could be used by a hacker.
A threat model is is developed to try and aid the development and design team as to all of the potential places where an application or system can be attacked, and then to focus and prioritize efforts to either mitigate or hat entirely if possible those potential attack points, based on probability and potential harm. Good threat modeling requires skills in analysis and investigation—two skills that make testers an important part of the process. Microsoft has even published a book specific to this topic; Threat Modeling by Frank Swiderski and Window Snyder (Microsoft Press, 2004).
At the end of the day, the customer doesn’t really care how many bugs were found and fixed. What they care about is the way the product looks, feels and performs FOR THEM. All of the metrics and burn down charts will not change the experience of the user if that users experience is not part of the testing process. Lots of testing areas don’t fall into the realm of “enter this and expect to see this” that is the key element of functional testing. The “ilities” play a large part in the user experience, and if we get those wrong, even the best designed and useful functionality won’t matter because it will be seen as too painful to use.
Customers want to know that they can get in, use their software to complete the work they need to do (or have fun with it, or meet an objective) and know that they will be able to do so reliably, securely and with a minimum of having to learn to jump through hoops to make the software do what they want it to.
Many of these areas need to be designed up front to meet this criteria, but without customer interaction, it’s hard to get that feedback necessary. Fortunately, using techniques like Usability labs, Dogfood users, and testing personas, many of these attributes can be addressed during the testing cycle, hopefully before the intended customers see them and voice their pain.