Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2009/12/29/problems_with_tdd

Problems with TDD

If you have not yet read it, please read Maria Siniaalto's 15 page "Test-Driven Development: empirical body of evidence." It summarizes the few empirical studies done to evaluate the effectiveness of TDD. In the conclusion you'll find:

Based on the findings of the existing studies, it can be concluded that TDD seems to improve software quality, especially when employed in an industrial context. The findings were not so obvious in the semi-industrial or academic context, but none of those studies reported on decreased quality either. The productivity effects of TDD were not very obvious, and the results vary regardless of the context of the study. However, there were indications that TDD does not necessarily decrease the developer productivity or extend the project lead-times: In some cases, significant productivity improvements were achieved with TDD while only two out of thirteen studies reported on decreased productivity. However, in both of those studies the quality was improved.

The empirical evidence on the practical use of TDD and its impacts on software development are still quite limited.

I mention this first because I've concluded that not only is TDD not useful for me but I don't think it's a generally useful technique. The important requirements are to have good, complete automated unit tests, to develop code for testing, and to do interative improvement through refectoring and rewriting. TDD promotes those, but my experience is that TDD pins down the code too early and my observation is that TDD by itself ignores certain classes of essential unit tests.

My position against TDD will be contentious to some, like those who believe that TDD is a required component in modern best-practices development. I quoted Siniaalto to show that there is no strong evidence to back that belief. I fully expect someone to tell me that TDD drastically improved their development style. My response will be they learned good practices, but those practices don't require TDD and can as easily be learned without TDD.

By the way, while my conclusion is in opposition to Siniaalto's, it's because the most successful TDD paper in her report comes from Maximilien and Williams about their experience at IBM. They went from ad hoc unit testing to good development practices based on TDD. I think good testing practices without using TDD would have given the same results.

Before going further I'll also quote from Kent Beck's "Test-driven development: by example":

One of the ironies of TDD is that it isn't a testing technique (the Cunningham Koan). It's an analysis technique, a design technique, really a technique for structuring all the activities of development.
This entire essay will describe why TDD is a weak testing technique and an incomplete development technique. I'll bring up other techniques which are not part of TDD but end up leading to better unit tests that should help make you more confident that your code works.

Test first vs. test last vs. good testing

By TDD I mean Test Driven Development, and specifically its test first approach. Wikipedia describes TDD as:

First the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards.

By contrast, people also talk about "test last". Test last is the extreme opposite of "test first". One good definition of test last is:
testing should be done before the code goes into production; it does not imply that the tests are automated.

When I say that people shouldn't do TDD I do not mean they should do test last development instead. That is false dichotomy, and it annoys me when I read descriptions which present those two styles as the only possibilities.

My own practice is to have good, automated tests, but these don't get put into place until the cost/benefit ratio makes the tests worthwhile; which is rarely at the start of the code development and always by the end. The test themselves are guided by the code, and the knowledge of where the failure cases might be in the code. In addition, I'll add tests which check the expected input range, and after the code is done I'll add tests which check my belief that the code is done, as well in some cases tests driven by code coverage or other reasons.

I expect people to point out that TDD does not preclude other testing strategies, to fill in those gaps. I completely agree. I agree so much that I mostly use those other good strategies, and not TDD. TDD seems to add little to the result.

Worked out TDD examples

I want to base my response in at least the spirit of empirical research. I can't, because I don't (and neither likely do you) have the resources to do those tests. What I can do is find some descriptions of TDD used to implement a problem and make comments about them to highlight limitations in TDD.

I give full props to those who have described the steps they go through to work on a problem. Even in the simplest of cases it's a lot of work.

I found number of basic TDD tutorials, based around addition and subtraction, either with basic add() and sub() functions or through depositing and withdrawing money from a bank account [1] and [2].

Those were too simple to have problems. I wanted something more complex. The most complete examples I found were Robert Martin's Prime Factors Kata, which he also works through in a video, and implementing the Fibonacci sequence in Gary Bernhardt's blog post How I started TDD and Kent Beck's "Test-driven development: by example". I don't know if Bernhardt's example is derived from Beck's, but it's the one I came across first.

Prime Factors

The Prime Factor Kata asks for a function which takes a number and returns its prime factors in an ordered list, including duplicates. For example, 12 would return 2, 2, 3. The test cases were 1, 2, 3, 4, 6, 8, and 9 and the kernel of the solution was:

  public static List<Integer> generate(int n) {
    List<Integer> primes = new ArrayList<Integer>();

    for (int candidate = 2; n > 1; candidate++)
      for (; n%candidate == 0; n/=candidate)
        primes.add(candidate);

    return primes;

Fibonacci

The Fibonacci sequence examples checked that the first few outputs were correct, giving fib(i=0, 1, ...) = 0, 1, 1, 2, 3, 5 . Both people ended with variations of the classic recursive solution, here from Bernhardt:

def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n - 1) + fib(n - 2)

Problem: TDD doesn't emphasize good test cases

When I looked at Martin's Prime Number Sieve, I first thought the code was wrong. It tests to see if 2 is a divisor, then 3, then 4, then 5, and so on. 4 can never be a prime divisor of the candidate because 4 isn't prime. Why does his code check for that possibility? Was there a bug?

Code should be readable, so that others can understand it and verify that it works. In the same vein, tests should serve as a way for others to check that the code is working. I looked at the tests, and noticed that the only prime factors tested were 2 and 3. Perhaps if 5 was a prime factor then there would be a problem when the code got to 4?

I couldn't tell from the tests, so I had to look more closely at the code. It then became obvious. All factors of 2 were removed, so there was no way that 4 could be a divisor. By construction, no non-prime candidate could ever work, so will never be added to the list.

The tests were not good enough to minimize doubt that the code contained bugs. I can think of a couple of simple variations of the code which would contain bugs and which would pass the tests. Yes, the tests were enough to help Martin get to a solution, but they shouldn't have been enough to convince him, much less others, that the code was right.

Some good tests might have included the primes 17 and 97 as well as 91 (=7*13). I can't think of simple bugs to put in Martin's code which would also cause those test cases to fail, excepting a hard-coded upper limit to the search space which would easily show up on code review.

Fibonacci Sequence

Bernhardt's Fibonacci Sequence did test enough numbers that I was pretty sure that algorithm would come up with the correct answers, although I would have preferred some larger numbers, like fib(12) = 144. (I picked that one because it's cute that 144=12*12.)

Problem: When do you add tests that should pass?

TDD says to add a failing test then fix the code. What do you do with tests which are expected to pass? For example, suppose I finished the prime factors code but upon review of the tests I have a niggling uncertainty that it handles prime factors greater than 3. I want to add a test case to find the factors of 91.

I asked this of Bernhardt, and he kindly addressed that in his followup essay "The Limits of TDD."

After the tests drove the first fully-functional design out, I'd add exactly the types of tests you describe. These wouldn't fail at first, but that's fine; TDD doesn't preclude such things, they're just outside its scope. What I would do, to make sure the tests were honest, is to intentionally break the code, watch them fail (probably along with several other tests), then unbreak the code. This gives me at least some of the confidence that TDD does - I know that something is actually being tested.
This is a bit different than what I would do. If the code is supposed to work then I don't want to touch the code at all. Instead, I add the test but make sure the test is supposed to fail, perhaps by saying the factors of 91 are 5 and 13. Seeing the failure is a check that I didn't make a stupid mistake in writing the test. Then I fix the test and see that it passes.

Mine is not his more TDD approach, although close. But I want to highlight his comment that "TDD doesn't preclude such things, they're just outside its scope."

That's exactly my point, and notably in disagreement with Beck's statement that TDD is "really a technique for structuring all the activities of development."

Other tests and other development approaches besides TDD are needed for good software development, including approaches which are conceptually quite close to TDD but not part of it. I say that the skills that are needed to detect and add good passing tests can equally be applied to developing good unit tests in the first place.

Only, without extra requirement of coming up with all of the tests first.

Problem: TDD does not consider worst-case scenarios

In "good test cases" I said that TDD doesn't stress the tests needed to convince yourself or others that the code was right, only tests to implement the code you think is right. Here I'll talk about a different sort of unit test that TDD doesn't help with - worst-case scenarios.

Prime Factors Kata

I implemented the Prime Factors Kata on my own. It took me a while too. I implemented the Sieve of Eratosthenes to generate prime factors, and only searched for factors up to sqrt(n). This has been my general approach for this sort of problem since college. I ended up with 29 lines of code, and I couldn't understand how Martin was able to write:

The final algorithm is three lines of code. Interestingly enough there are 40 lines of test code.
(BTW, I counted 15 total LOC in the program and 43 LOC in the test module, or 3 vs. 12 if you only talk about "real" code, vs. import statements, function definitions, lines with only a closing brace, and so on. In either way of counting, it's still less than my 29 lines of code.)

If you listen closely in Martin's video you'll see that he considers his three line solution to be "more elegant" than the Sieve solution. I really didn't understand assertion. His solution is going to be slow for almost all cases. I timed Python implementations of our two algorithms for numbers around 200,000. His was 150* slower than my sieve-based solution, and it gets much worse after that.

If you listen even more closely, I think you'll hear the reason. He introduced the problem by saying his kid was learning about prime factors at school, and Martin wanted a program which could solve the same sort of problem. In that case, the prime factors are small. Few teachers would be so mean as to require their students to find the prime factors of 524,287 by hand.

If the possible input range was only, say 1 to 150, then I could see how Martin's code is elegant. But if the input range is 1 to 2**32 (which is more like I expected), then it's clearly not elegant because finding that 2**31-1 is prime will take about 2**31 modulo tests. Computers are fast, but that's excessive. (BTW, it's also cute that 2**31-1 is both max signed integer and a Mersenne prime.)

In either case, there should be tests for values which represent a worst-case scenario. In this case that would be a prime at the high end of the expected range. His largest test was 9. Mine was 2**31-1.

Fibonacci

There are three problems with the Fibonacci implementations. One is that the classic recursive solution (without memoization) takes exponential time. I implemented the solution iteratively and compared the results. Bernhardt's solution for fib(32) takes about as long as my iterative soluton for fib(100000), and after a minute I gave up computing fib(40) recursively.

Another is that Python's default stack size is 1000 function calls. Doing fib(1500) quickly gives a "RuntimeError: maximum recursion depth exceeded" exception.

The last is in Beck's code. Assuming the recursive solution could compute it in time, fib(48) is larger than 2**32. He uses a Java 32 bit integer, so his code would silently overflow.

Discussion

TDD creates unit tests which are used to develop and refactor code. These tests are only a subset, and not even an essential subset, of the tests needed to check that the code implements the requested feature. You may think you are finished with the code and you pass all the TDD tests, but you still aren't finished with the development process. You still have to do other important unit tests.

I'm certain that Beck and Bernhardt know the limitations of their Fibbonacci implementations. I'm really surprised they didn't mention the problems in their solution. It would have been the perfect place to show that other types of unit tests can't be ignored, and discuss how to fit them into the TDD development process.

I also wish that Martin has been less dismissive of the sieve solution. It's obvious that others have mentioned it to him. He should have responded by pointing out that the solution was overkill for the problem range. I also wish he had included tests for the high end of that range. (I have the idea based on other writings that he's not an algorithms person, so he also might not have been aware of the performance problems in his solution.)

Problem: TDD doesn't give you confidence that the code works

Many TDD advocates bring up confidence as a reason for doing TDD. In his book Beck writes:

Psychological - Having a green bar feels completely different from having a red bar. When the bar is green, you know where you stand. You can refactor from there with confidence.
and others write similiar things.

If your goal is to be confident in your code, then TDD is a weak method for developing those tests of confidence. I've now shown a couple of TDD examples, which were done with TDD principles foremost in mind, but which failed to consider worst-case solutions. You should not be confident that your code works just because your TDD tests pass.

When I write my code, I'm not confident that it works. I'm not even confident that a refactoring works despite passing all of the unit tests. I worry about edge cases I didn't think of, I worry about implementation flaws, I worry about worst-case scenarios.

If I write the tests first, I also worry that I've overfit my code to the tests. This is a problem that happens in statistical modelling. Given any set of data points, I can fit them to a model. The next question is, is the model valid and useful? The way to check is to use them to make predictions, and see how well it matches reality. This in turn means testing the model with data which wasn't used to make the model.

I feel the same way about my code. I start with doubt that my program works, but with confidence that I can develop new tests which should pass if the code is correct. To reduce doubt, I'll write new tests and see if they pass or fail. Passing tests reduces my doubt, failing tests means I need to figure out what happened, and I'm back to more code development.

TDD by itself cannot give you that confidence because it excludes the idea of adding tests which are expected to pass. On the other hand, developing unit tests even if just after the code is written (but long before it's deployed as is done with test-last), guided by knowledge of how the software is implemented and experience in how the code can fail, can give you all the benefits of TDD, plus be able to handle the cases that TDD doesn't handle. TDD is one technique for learning those skills, but it is not an essential technique.

Incorrect claim: TDD leads to 100% coverage

Beck and others write that TDD naturally leads to nearly 100% test coverage. In his book he writes "TDD followed religiously should result in 100% statement coverage." Elsewhere I've seen people write similar things.

That's not true. Yes, under TDD new code should have 100% statement coverage, but what about refactored code? This is especially true if the refactor is more like a rewrite, perhaps to replace an algorithm with a faster version.

If I start with Martin's Prime Factors code and change it to my prime sieve based code, I can think of several ways where part of the refactored code wouldn't be tested. You can easily come up with plenty of other refactorings where part of the new code are not tested.

Yes, people will respond that TDD doesn't mean you can't stop being smart, and you must remember to include those tests, or even to add those tests while refactoring the new code. That's very true. I only point out that refactoring doesn't have the goal of maintaining full statement coverage, and therefore TDD doesn't either.

If you feel that code coverage is needed, above and beyond code inspection and manual methods, then there are tools to help automate coverage tests. The best covered tool I know of is SQLite. Its "veryquick" tests run about 42 thousand tests to get 97.23% coverage of about 66,000 SLOC, with additional tests which get 99.50% statement coverage of the entire code, and 100% coverage of the core. This was an intense and dedicated effort which does not and cannot fall out as a simple consequence of TDD.

Complaint: TDD freezes the API too early

This is my personal complaint. It is not derived from those worked out examples.

My own development style is a mixture of many techniques. When I've tried doing TDD I feel like it locks me down too early. My code in the early stage is very fluid. I'm mostly trying to get a feel for what it's going to look like. At that stage the code isn't meant to even compile, and the only machine it runs on is the model in my head.

This is especially true for cases where I'm trying to come up with a good API to implement the new functionality. My test cases are short programs which would use the API, and I try out different example programs to get a feel for usefulness, ease-of-use, ease-of-implementation and other factors.

If I use TDD here, I don't know what the API is going to look like, so how do I write the tests? I won't know what the API is going to look like until I've had a feel for implementing it but even then the API changes often. If I have tests for the API and the API changes, then there's the extra mental barrier of having to change all the tests for the new API.

Especially bad are the cases when I realized that some function isn't needed and should be deleted. With TDD that would also mean deleting the tests which went along with the function, and it would likely mean I've already spent time debugging the function, now all thrown away.

I've seen that in the code katas we do in the GothPy meetings (the local Gothenburg Python Users Group). Once we have working code with unit tests, I don't want to remove the function, and I start thinking about ways to adapt it, rather than thinking about ways to simplify the overall code base.

XP allows something this as a spike solution, but says that you should expect to throw the implementation away and start anew. I don't.

Once I have a good sketch of how the code is going to be, I often continue by filling in the details. At this point unit tests starts to be useful, but if I'm developing an API I'll write a simple functional test which uses the API, and make it work. It really might be a command-line program or even a __main__ for the current module. This helps give me get more concrete solution and once that's solidified enough code I start developing my automated unit tests.

Since I'm not using TDD, I used code coverage (either manually or through coverage tools) to improve statement coverage, and I use my knowledge of the problem to come up good test cases. The result seems to be no less effective than TDD, plus as a methodology it includes development tests which TDD does not.

Conclusion

Good testing practices help make good code. Automated unit tests, written by the developer and run often during the development stage, is a good testing practice. TDD uses those sorts of tests, but its focus on test-first, with failing test cases that reflect missing code, exclude important tests in the development process.

TDD can easily be modified to handle these other cases, but the result is simply "good unit testing", without the test-first aspect that makes TDD what it is.

Questions or Comments?

This is a contentious topic with a long history and plenty said about it. I think I've contributed something new to it with my commentaries on what should be exemplar TDD-based solutions. I hope you found it interesting if not enlightening or useful. With three nearly complete rewrites, it was by far the hardest essay I've ever written for my site.

If you have any comments or feedback, please do let me know.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB