Why slaying mutants is good for measuring the quality of your tests.
Test automation of a piece of software is a well developed practice these day. But assessing the quality of those tests is not easy. One really big mistake is to rely on coverage because you can have 100% (or close to) coverage but still having really low test quality.
Say you have a beautiful function foo
defined as
|
|
And your test looks like:
|
|
If you run the test with your favourite test runner, it will pass! (obviously because we passed true
in the assert
function call) You get the greenlight. Even sadder, you also get a 100% coverage for this function because we called it in the test (even if we don’t use the result of that call to determine if the test pass or not).
Of course this is a contrived example to put emphasis on my point above that coverage is not a reliable metric for code quality. (At the bottom of the article I link to a Kata revolving exactly around this idea of false coverage, and about fixing bad tests)
One way to get more confidence into you test suite is to use Mutation testing. The way it works is by changing a tiny bit of your code, say replace ==
by !=
in one place in your code, then run your test suite against the modified version of your code.
This new version of your code is called a mutant. If all your tests pass on the mutant, it means your test quality is not good enough because it did not catch this change. In this case the mutant is sometimes called a zombie because it lives (did not get ‘killed’ by the test suite).
Then you repeat the process of introducing another mutation from the original code to create another mutant (say this time you replace let c = new Foo()
by let c = null
) then you run the test suite again and you determine if the mutant has been killed.
Rinse and repeat a lot, then you count how many mutant you killed in relation to how many were produced. This ratio is called the mutation score, and should be as close to 1 as possible. Mutation testing allows to be more confident in the test you write.
There are two main assumption behind mutation testing, let’s see what they are:
The first one is the competent programmer hypotheses. It states that most bugs introduced by an experienced programmer into a codebase are small syntactic errors. The second one is the coupling effect hypotheses. It asserts that simple faults can create other faults/bugs in an emergent/cascading fashion.
The changes in the code (replacing ==
by !=
like in the first example) are called mutations, and they are defined by mutation operator. There are many different families of mutation operator.
You have operator on (non-exhaustive list):
==
and !=
, ||
and &&
, <
with <=
, >
and >=
*
with +
, -
and /
null
, changing method and field scopeOf course it’s not all good, if it was we’d be using it in every project since it was invented (in the 70’s!)
Because we mutate pieces of code, it happens that mutant causes a crash or an infinite loop (when say you change the condition on a loop) and that hinders the tests.
Some mutation testing framework deal with that by letting you disable some classes of mutations operator if it does not play well with your codebase.
In most cases you get a timeout for that mutant, so you don’t get a clear result. And it can slow down considerably your mutation testing run.
Another negative thing (and, from what I understand, until recently the major drawback that prevented spreading of mutation testing) is that for even a small code base you can generate hundred or thousands of mutants. Then you need to tests every one of them. That’s a huge resource sink, that could only be overcome for small projects or medium-big projects with huge resources. (Remember the idea was first devised in the 70’s)
Now we have much more resources at hand so it is less a problem but still, that can quickly add time to your test pipeline (especially if your test suite is slow). To help the performance aspect, there are numerous optimizations that have been devised, like only generating mutation for line of code that are covered by at least a test, testing the mutants by running only the specifics tests that covers the line with the mutation instead of the whole test suite etc. This make using mutation testing a possibility.
Another option is to use extreme mutation (the paper ‘Will My Tests Tell Me If I Break This Code?'3 is a good read). Basically extreme mutation will remove any code from a tested function, and replace it with a value of the type of the method.
This generates less mutant so is quicker to test. You can find here a comparison of regular mutation vs extreme mutation testing of various java library with the framework Pit using a standard generator (called Gregor) and a extreme mutation generator (called Descartes).
Below you can find the result of that comparison:
Descartes | Gregor | |||
---|---|---|---|---|
Project | Time | Mutants | Time | Mutants |
authzforce | 0:08:00 | 626 | 1:23:50 | 7296 |
aws-sdk-java | 1:32:23 | 161758 | 6:11:22 | 2141689 |
commons-cli | 0:00:13 | 271 | 0:01:26 | 2560 |
commons-codec | 0:02:02 | 979 | 0:07:57 | 9233 |
commons-collections | 0:01:41 | 3558 | 0:05:41 | 20394 |
commons-io | 0:02:16 | 1164 | 0:12:48 | 8809 |
commons-lang | 0:02:07 | 3872 | 0:21:02 | 30361 |
flink-core | 0:14:04 | 4935 | 2:29:45 | 43619 |
gson | 0:01:08 | 848 | 0:05:34 | 7353 |
imagej-common | 0:08:07 | 1947 | 0:29:09 | 15592 |
jaxen | 0:01:31 | 1252 | 0:24:40 | 12210 |
jfreechart | 0:05:48 | 7210 | 0:41:28 | 89592 |
jgit | 1:30:08 | 7152 | 16:02:03 | 78316 |
joda-time | 0:03:39 | 4525 | 0:16:32 | 31233 |
jopt-simple | 0:00:37 | 412 | 0:01:36 | 2271 |
jsoup | 0:02:43 | 1566 | 0:12:49 | 14054 |
sat4j-core | 0:53:09 | 2304 | 10:55:50 | 17163 |
pdfbox | 0:44:07 | 7559 | 6:20:25 | 79763 |
scifio | 0:24:14 | 3627 | 3:12:11 | 62768 |
spoon | 2:24:55 | 4713 | 56:47:57 | 43916 |
urbanairship | 0:07:25 | 3082 | 0:11:31 | 17345 |
xwiki-rendering | 0:10:56 | 5534 | 2:07:19 | 112605 |
You can see that going from extreme mutation to standard mutation testing is roughly an order of magnitude in number of mutant generated and time taken to test every mutant.
Try it sometime, for me it was really fun and eye opening to catch bad tests I wrote!
And remember:
Software testing proves the existence of bugs not their absence.
Parsai, A., Demeyer, S., De Busser, S.: C++11/14 Mutation operators Based on Common Fault Patterns(2004) ↩︎
Dadeau, F., Héam, P-C., Kheddam, R.: Mutation-Based Test Generation from Security Protocols in HLPSL(2011) in: 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation ↩︎
Niedermayr, R., Jurgens, E., Wagner, S.: Will My Tests Tell Me If I Break This Code? (2016) in: Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED ’16) ↩︎