Desperately Seeking System Tests

In one of my former lives as QA manager, one of the problems that continually grated on my nerves was the mysterious nature of our nightly regression test failures. Our test suite was incredibly fragile, such that in analyzing failures during our convergence period to determine whether they were truly indicative of a potential customer problem or just whiny tests, I was constantly faced with making decisions based on vague error information emitted from black box tests of unknown origin. A lot of manual analysis was required for every release.

In my quest for a better way, I read Xunit Test Patterns, by Gerard Meszaros and holy cow, was I a convert. While the practices of unit testing and test driven development were conceived to improve code development, I liked some of the ideas so much that I wanted to see if they could be applied at the system testing level:

The book is huge and not simple to summarize (and available for free online in an early form), but in the end, my epiphany came down to one simple idea:

The purpose of testing is to give you feedback about your product. The more effectively a test can do that, the better it performs its job.

What do I mean by an effective test? The test is constructed such that it gives focused and meaningful feedback to the developer about the undesirable behavior it has detected. A few traits that make a test more effective are:

Is expressively written and named
Behaves expressively at runtime
Tests only one thing at a time, ideally even one part of the code
Is self-checking

For comparison, typical test names, which give me the shivers, are a.stp, a1.stp, a2.stp, bigfile.stp, bigfile2.stp, reallybigfile3.stp. And they fail with errors like "error code 3", "error code 127", … How about some English for crying out loud? How about not making me memorize bizarre sequences of alphanumerics to know why our product is failing?

There are other ideas I've forgotten, but I think you get the idea…Tests should ideally all be expressive and single minded, so you reduce debug time and can understand your coverage and behavior at a glance. Tests that aren’t written in such a way are much less valuable and should be avoided as much as possible. This seems so logical to me that I think it holds true for both unit and system level tests.

What have we found to be the reality? Sadly, in applying these ideas, we’ve found that such an approach is totally insufficient for building a comprehensive test plan.

For example, let’s say in the 3D InterOp product line that you want to enhance a reverse engineered reader to extract a particular type of surface for the first time.

According to my interpretation of test driven or requirements driven development, one would spend time thoroughly analyzing the input file format – how this type of surface is stored. One would also spend time with the source CAD system in order to generate a variety of instances of this particular surface type, so as to exercise the format and the reader. This sounds pretty thorough.

Next step, just to make sure we’re safe, which of course we are because we wrote awesome CAD translation tests, we gather every customer file we can find, old, new, large, small to "industrialize." I will refer to this as our "monolithic, mysterious test suite."

Then what happens? Sigh. We discover that we’ve only … just … begun. Sadly, our designed tests gave us almost no indication of how the code would behave in a realistic customer situation.

Bummer, I like tests with good names.

So what do we do now? Do we need to goback to our old, fragile ways? Not exactly. We’ve had two big changes:

We still do design tests. But we do not rely on them for product validation. They are used mainly by developers to validate that the code does what they expected.
We now write tools that automatically analyze our run of the mill industrial test suites so that we can understand their composition and begin to classify tests more specifically. This has been an incredibly positive step towards better understanding our coverage, requirements, and prioritizing development.

I don’t even use the phrase "monolithic, mysterious test suite" around these test suites that much anymore. I grudgingly admit that they’re valuable, not because they’re expressive (they’re still not) but because they contain cases that we didn’t think of, lots and lots and lots of them – many more than we could ever write ourselves.