Coding Heresy Part II: Unit Tests are a Code Smell

Oct 28, 2024

That title has got to be the most controversial thing I’ve ever written! So now that I’ve trolled the entire development community, what do I have to say for myself?

Of all the practices in software, unit testing, especially test-driven development (TDD), is one which has taken on especial fervor. Famed software intellectual Robert "Uncle Bob" Martin has been preaching the virtue of unit tests in the strongest possible terms for over two decades. He compares it to double-entry bookkeeping in accounting, or the practice of surgeons washing their hands. He claims that no developer can call himself a professional who releases code without full unit test coverage.

This is the beginning of my objection to unit tests. The very fact that they've become entangled with moral virtue causes people to use them ineffectively. The conversation stops being about tradeoffs like "Writing unit tests adds X% to development time, but reduces bug count by Y%", and becomes "We create unit tests regardless of any benefit that comes from them, because it's the right thing to do".

Terminology

Before I go further, let me define my terms, starting with "What is a unit test?". It’s important, because the exact usage isn't terribly consistent among developers. I'm going to go with the top Google Result, which leads me to the SmartBear blog:

A unit test is a way of testing a unit - the smallest piece of code that can be logically isolated in a system. In most programming languages, that is a function, a subroutine, a method or property.

In practice, unit tests tend to be organized by class. In a typical project, each class will have an accompanying file of unit tests.

If I have a class that goes

it would have an accompanying file of tests, that look something like

Test Driven Development, then, is a discipline of writing these tests in parallel with the code, in a cycle called 'Red, Green, Refactor'.

You write a test first:

Run it and watch it fail (Red). You then write only enough code to cause the test to pass (Green).

Now that you have a passing test, you can take a refactoring step, where you rearrange the code to make it more readable, but in a way that keeps all the tests passing. Finally, you write another failing test, and begin the cycle again.

Larger automated tests that exercise several modules acting together are variously called "functional", "system", or "integration" tests1. These terms can have precise meanings which differ from organization to organization. For this article, I'm sticking with "integration test" as a catch-all for "larger test which exercises several modules acting in concert".

There's another category of even larger tests, which try to test the entirety of your software from the UI to the backend, including integrated services. These take the form of a script in Selenium or a similar tool, which acts by simulating mouse clicks and keyboard strokes upon the UI. I refer to these as “end-to-end tests”.

So back to the main question

We need to clear aside the trappings of moral virtue and ask: "What benefit do we expect to derive from unit tests?"

Let's give the mic to Uncle Bob, on the potential benefits of unit tests.

Why do we do TDD? We do TDD for one overriding reason and several less important reasons. The less important reasons are:
We spend less time debugging.
The tests act as accurate, precise, and unambiguous documentation at the lowest level of the system.
Writing tests first requires decoupling that other testing strategies do not; and we believe that such decoupling is beneficial.
Those are ancillary benefits of TDD; and they are debatable. There is, however, one benefit that, given certain conditions are met, cannot be debated:
If you have a test suite that you trust so much that you are willing to deploy the system based solely on those tests passing; and if that test suite can be executed in seconds, or minutes, then you can quickly and easily clean the code without fear.

source

So far, I can't disagree

If you could have a test suite which would really give you that level of confidence, and you could produce it via TDD, that would make TDD worthwhile. Such a thing would be a game-changer. It would completely justify his fervor for them.

I assert, thought, that for most code, unit tests can't possibly provide that level of certainty against bugs. I wish they could. I would be a unit testing fanatic too... if only they could.

But they can't.

Rather, I find unit tests to be a relatively weak filter for catching bugs. There is a principle of engineering that applies in many disciplines. It's true in mechanics, it's true in electronics, and it's true in software: most problems appear at the connections between units. Consider that trivial Cat class. If there's ever a bug in it, it's not going to be because the constructor somehow doesn't set the age property (it obviously does). It's going to be something like

Oh, It looks like the frontend expects the age to be in months, and this code on the backend assumes its in years.
There’s a rarely-used function over here which expected the property to be named cat_age
Turns out the database column that stores the age is an integer, and it silently truncates any decimal value we send it.

These are all integration issues.

Most real-world code is less amenable to unit testing

That Cat class has the benefit of being self-contained. For most real code, its whole purpose depends on interactions with other modules, including frameworks, libraries, and databases. In order to pry the units apart and test them in isolation, you need to substitute in a bunch of mocks and fakes for all their dependencies. This makes for less valuable tests, because

a) You’re testing how it interacts with the mock, not how it interacts with the actual dependency

b) When you test interactions with mocks, they end up looking something like this.

Here’s a new method on that `Cat` class.

And Here's what a test would look like.

It tests that the Cat.play method invokes the expected method on the toy_mouse… but it doesn’t give any evidence that this was the correct thing for it to do. It doesn’t verify any results from running the play method; rather, it only tests that the play method does what it does.

On the other hand, high unit test coverage comes with a high cost

Size

You triple the amount of code you have to write and maintain. This seems to be a pretty typical experience of unit tested code: That a thorough testing harness is 2-3 times as long as the actual software. All that code incurs an ongoing maintenance cost. It needs to be understood, refactored, and updated as requirements change, just like everything else. Remember, size is the enemy of code. Size is to a codebase as weight is to an airplane. The more size you add, the less nimble the code becomes, the less capable it becomes of adapting rapidly to do new things.

Brittle Tests

Look what happens if I change the Cat class to take a birthdate in the constructor, from which we derive the cat's age. Not only does the test_cat_age test break, every test which needs to be setup by creating a Cat will break as well. Tests rely on the interface of whatever they're testing being stable, and at this fine-grained level, interfaces are typically not stable. Shifting a few properties from one class to another where it fits a little better, adding and removing parameters from method signatures, these are good and normal parts of refactoring. In unit testing an actual system, a small change that rearranges the public interface of a class can cascade through your test suite, leading to dozens of broken tests. This makes small improvements costly enough that they're a little less likely to happen. Bad decisions made early in coding become baked into the design.

I remember reading one blog post (uhhhh can't find it now) where the author bemoaned being the only developer in his company who cared about unit testing. He talked about how he spent long months trying to add test coverage to every class he modified, finally started to see the coverage metrics tick up towards something acceptable, and then he would go on vacation for a week and find 400 broken tests when he came back. Little readjustments to interfaces that his colleagues had made caused cascading breaks to his test suite. The blogger narrated this story with a sense of

I read that story and thought "If it was possible for a week's worth of development to break 400 tests, those were bad tests".

False Positives

Did his colleagues actually introduce 400 new bugs in the week he was gone? More likely, it was a matter of false positives. Because of the cascading effects of small interface changes, unit tests are prone to them. A failing unit test tends to indicate not that the code broke, but merely that something changed.

I once lived in a county that would blow their tornado sirens every time the National Weather Service issued a thunderstorm warning. I'm sure there was some bureaucrat who decided "Better safe than sorry! Every thunderstorm could potentially contain a tornado, and should be treated with the same caution". The result of this, of course, was that the county had no tornado warning sirens. They had "It's raining!" sirens.

A test with a high rate of false positives approaches being equivalent to no test at all.

Studies

It's very, very difficult to do an objective experiment on software quality, but what data we have seems to more-or-less point to the similar conclusion: The meta-analysis from Code Complete estimated that unit tests capture about 30% of defects. This study of a Microsoft team showed unit testing decreased defect count by 20.9%, while increasing development time by 30%. There's quite a bit of variation, though: Some studies show as high as 60% of improvement, while Other studies show no improvement at all.

My own anecdotal experience is about in line with the Microsoft study; that unit tests capture maybe 20% of defects, and those concentrated in specific kinds of code that really benefit from them2. That points to them being a useful tool, but very context dependent whether they're worth the cost (medical device code, absolutely; fast-moving consumer products startup, maybe not). But in none of these cases do they provide a ‘test suite that you trust so much that you are willing to deploy the system based solely on those tests passing’.

But but but.... Martin Fowler! Kent Beck! Michael Feathers!

It's not just Uncle Bob. These are some of the most respected authors in our field, and they've all thrown their weight behind Unit Testing. How could they all be wrong, and how dare I set myself against them?

A while ago, David Heinemeier Hansson (creator of the Rails framework) published an article on Test Induced Design Damage. This article led to a series of online debates on the merits of unit testing.

It's the best sort of debate. It's genuinely respectful; 3 professionals sincerely trying to understand each other's perspective. While listening to it, I had an intuition. They're not talking about the same kind of code.

In this debate, DHH seems to be viewing "Most Code" as, well, the kind of code I usually write for my jobs. It's products intended for customer use.

Much of the interesting technical work is on the frontend, crafting and massaging the user experience.
The backend code is mostly CRUD operations: relaying data from the frontend to a database, and back again. It's more about bookkeeping than algorithms. It has exciting challenges, but they have more to do with building an elegant data model, and making it robust at large scale, than with the computations it performs per se.
The core success criteria are hard to define. In the end, it's a customer who swipes his credit card with a delighted smile.

This makes sense! DHH runs BaseCamp, a successful consumer products company.

Martin Fowler and Kent Beck, on the other hand, seem to consider "Most Code" to be things like developer tools, search algorithms, libraries and infrastructure. This is code where:

By and large, the value it provides is in interesting computation.
They are meant to be consumed by other software, rather than human users.
Because of this, their interface is precisely defined, and rarely changes. When it does change, it needs to be careful to honor its contracts with existing consumers.

Even I will agree, this kind of code benefits tremendously from unit tests. But it isn't the kind of thing that most of us work on most of the time. Most projects of this kind are open source. They are so satisfying to create, that developers will work on them for no pay at all, simply for the love of the craft. It's a rare, sought-after job that lets you get paid to write this kind of code full-time.

I think what's happened is that Fowler and Beck, since they became successful authors, have had the privilege of picking and choosing their projects, and they gravitate towards the most technically interesting projects.

An Example

That doesn't sound right! I know that Cat class from before is pretty trivial, but even basic CRUD apps benefit from unit tests.

I'll show you what I mean. Let's start by looking at a typical sort of frontend component I would write (React).

To unit test this, you would have to separate it somehow from the React framework. That's not really practical - the whole point is to produce a React object, filled with React-specific properties. You could have a test which burrows deep into the object and asserts that somewhere in there is an object representing an element, with a value equivalent to value.firstName. But what would be the point? It doesn't test whether the form behaves correctly.

I guess, for components which just render HTML. But most interesting components have more logic.

Consider this parent component of the ContactForm.

This code is almost pure logic, but again, I find that while it’s possible to perform unit tests, they wouldn't really add value. Take the formStateSetter. There might be a way to pull it out and test it in isolation. But most of what it does is take a framework object, and pass it to a framework method. You'd have to create a mock object and assert that it gets passed to a mock method. That wouldn't test the thing which is important about that formStateSetter, which is, whether it gets called at the right time, and whether it sets the correct framework state. It's a meager test, and making it would damage the code by introducing a whole unnecessary layer of abstraction.3

There's one bit of logic here which is genuinely unit testable. You could test that one switch statement, by initializing the component and asserting that it returns an EmployeeContactForm component when you pass in the contactType === 'coWorker', and a ContactForm component if contactType === 'externalContact'. But notice also, that you could rewrite this component as

Which reduces that test to triviality. An interesting to point to note here: Making the code declarative, rather than procedural, tends to make it better code - more succinct, fewer branching paths - in a way that makes unit tests unnecessary.

How about the backend code?

The endpoint this form would submit to might look something like

You could write a unit test for this method by injecting mocks for NewContactValidator, Session, request, and jsonify, and asserting that it calls them. But again, if you do that, you're not really proving that it produces correct results; you're only proving that it interacts with the mocks.

Maybe unit testing controllers doesn’t always make sense. But that validator! That certainly contains some unit-testable logic.

Fair enough! that could be implemented like

There! That should absolutely be unit tested.

Indeed! But notice, there's a better way to do it: Use a validation library.4

Once again, it’s now so declarative that unit tests have become trivial. The very fact that this was such a crisp, data-in-data-out operation, the fact that it's output was so cleanly verifiable, was a hint that it was better handled by a library.

This is what I mean when I claim that unit tests are a code smell.

I mean it in the same sense that comments are a code smell. It's not that all comments are bad! It's that their presence hints that there might be room for improvement. If they're not necessary to understand the code, they are dead weight, and if they are necessary, they hint that the code they're attached to is difficult to understand. The comment is a second-best, and the ideal would be to refactor it such that it is no longer difficult to understand.

In the same way, if the unit tests are brittle or trivial they are dead weight, and if they prove something interesting, they hint that it might be functionality better handled in a library. They smell of "not invented here" syndrome, of a bored senior engineer who thought it would be an interesting challenge to roll his own validation library.

I can't believe it's 2024 and you're still arguing this. Isn't thorough unit test coverage an established best practice at big companies?

At least at the ones I've been in, yeah. But they don't do much to reduce bugs, nor does anyone expect them to. For most developers at those companies, writing unit tests was just one more of the small indignities you have to put up with in a large corporation, like agile workshops and droning compliance training videos. They were part and parcel of a culture that emphasized trudging through the process, rather than innovating. They were at best a minor augment to manual testing, not a replacement. They certainly never fulfilled Uncle Bob's promise, that they would allow you to deploy to production based on solely on their passing.

So what now? Are you recommending we just hire a legion of testers, and go through a long testing phase before deploying? Or give up, and push potentially unstable code to production, consequences be damned?

Far from it!

Uncle Bob was wrong when he promised us that unit testing could give a clear line in the sand between pure (unit tested) and impure (not unit tested) code. His promise that the extra effort of unit testing would pay off by producing code that you could deploy to production without fear, was an empty promise.

But there's a kernel of truth there. What if, somehow, there was a technique for automated testing which was thorough enough, that you could just push new code to production without fear? I've certainly seen the fear of introducing bugs cripple entire software groups. I've seen it often enough, I've taken to calling it The Monster that Eats Software Companies. If there was a way to slay that monster once and for all, even if it were costly, it would still be worthwhile.

Defense in Depth

I want to say up front, that this approach is nothing new, and nothing unique to me. DHH and Kent Dodds both have articles making basically the same point. I got the "Defense in Depth" name from Hillel Wayne. And I have noticed, especially in the Python/Ruby worlds, that a lot of the culture and tooling is more about writing integration tests, though they sometimes get called "Unit Tests" for the street cred.

In the famous software tome Code Complete, Steve McConnell presented a summary of all the best defect reduction techniques. What he found was that there was no silver bullet - no surefire way to prevent bugs. Rather, each technique could be expected to catch 20-30% of software defects.

But if we abandon the idea of clear division between pure and impure code, we can still get something useful - more tested code and less tested code.

Think of each of these testing techniques not as dividing line between 'safe' and 'unsafe'. Think of it a sieve, which can be counted on to catch ~25% of the remaining bugs.

When viewed that way, no one sieve is going to provide certainty against bugs. But, imagine what happens if you layer 10 different sieves on top of each other.

Static type checking and static analysis
Unit tests around the pure algorithmic code where they're really useful
integration tests around frontend components
integration tests around backend endpoints
end-to-end tests verifying the key paths through the application
containerized deployments to ensure consistency between environments
Feature flags to release new features to only a limited audience, and enable A/B testing
Load testing, error testing, penetration testing, and security scans, to test the code's behavior under stress
Detailed telemetry and usage metrics
A small corps of professional testers doing their ingenious best to find exceptional cases

if you catch 25% of the remaining bugs in each layer, all 10 layers together will catch 95% of the defects in your system. It's even better than that: the remaining 5% of bugs, the ones which have somehow escaped notice from all these layers of testing, are likely to be the kinds of bugs that escape the notice of your customers too.

When we view it this way, it no longer makes sense to have draconian rules like "Every line must be covered by tests." Rather, each layer of tests is subject to the 80/20 rule. The most common-sense 20% of possible unit tests will give 80% of the results you'll ever see from unit testing. It's a better use of time to stop there and add a layer of integration testing on top, than to 5x your effort to max out unit testing.

Static type checking

I've come around full circle on this one. I was an apostate from statically-typed languages like Java and C# to dynamically typed ones like Python and Ruby. Typescript and Go brought me back into the static-typing fold.

My experience with C# was that the certainty of static type checking was nice. But somehow, any given task seemed to take twice as much code to accomplish in C# when compared to Python. And while static typing would catch an occasional bug, the brevity and readability of Python more than made up for it.

But then I found Typescript - a superset of Javascript that adds static type checking. It made the code a tiny bit more verbose.... but it was a 10% increase, not a 100% increase. And it made an entire category of bugs simply disappear.

Once I'd gotten used to Typescript, I also realized that, for all its verbosity, the static typing in C# wasn't even very good static typing, because everything could potentially be null! You would still have to step through your code thinking about all the things that could possibly be null.

Unit Tests

For all my complaining about them, I don't want to dispense with unit tests entirely! Even in our modern world with our rich ecosystem of libraries on Github, you still sometimes need to write non-trivial algorithms. These absolutely benefit from unit tests. The most effective unit tests also happily turn to be the ones that are easiest to set up - the ones that require no mocks. If you are writing your own Lisp parser, it makes sense to have tests like

Even here, though, make note. It's not digging in and testing the smallest possible units within the Parser in isolation. Rather, it's testing the Parser at its public-facing interface. It's testing the behavior that external modules will rely on.

Integration testing

Remember that 'NewContactForm' component that I used as an example of something which couldn't be unit tested effectively?

Here's what an integration test would look like for the same code.

There is no attempt here to test this component in isolation. It lives within its own native environment of mouse clicks, keyboard interactions, and the React framework. Again, it draws a boundary at the public-facing interface for the component - inputs including mouse/keyboard interactions, and outputs including http requests. It tests the component in a manner which is as close as reasonably possible to how it is actually used.

I dislike mocks in general, but I find it makes sense to make an exception for mocking outgoing api calls.

Backend integration tests

Rather than trying to pry apart the individual pieces of that endpoint and test them in isolation, you can do something like

This test is doing something remarkable. It's exercising the entire endpoint. When you run the test, that endpoint receives an http request, validates it, saves it to a test database, and returns an http response. Not only is this simpler than prying apart the individual pieces, but its a better test. Like with the frontend integration test, the test isn't all that far removed from being a plain-english description of the behavior we expect from the endpoint.

In both cases, they are un-brittle tests. They're not subject to breaking because some under-the-hood detail changed. They describe the behavior they're expecting, and they're not really going to break unless the component or the endpoint no longer behaves that way.

End-to-End tests

If you travel back in time a couple of decades, the role of tester would involve executing a test script. This would spell out the exact procedures for testing a piece of software in excruciating detail - things like

Click on the "First Name" textbox
Type the name "Bart"
Press the tab key
Type the email "bsimpson@hotmail.com"

It's not a great use of a skilled tester's time. It fails to make use of his ingenuity, and instead it asks him to be as much like a robot as possible. These kinds of tests plans, though, are low-hanging fruit for automation. They're a useful augment to integration tests, as they can test workflows through an entire app, including integrations with external services like email.

Compared to integration tests, these are slow and brittle. They have more moving parts, and they're prone to breaking because of small changes in the UI, or because a page loaded too slowly. The sequence of interactions a user can possibly have with an app quickly fan out to infinity, making them even less practical than other test types for covering every possible scenario. But they can give confidence that entire paths through an app remain intact. An end-to-end test can prove that after you create a contact, if you start composing an email that contact will come up among the suggested recipients. It's an important workflow, and one that's hard confirm by testing individual endpoints and components.

Containerized Deployments

Most of the anxiety that breaks the momentum of small companies, isn't about bugs that were somehow missed after reasonably conscientious testing. It's about those bugs which show up nowhere save for the production environment. Using container technology both helps to make these bugs less likely by ensuring consistency between the testing and production environments, and it makes a quick rollback easy in the event something goes wrong anyway.

Production Metrics and Staged Rollout

When I read about the testing practices at the big players (Google, Meta), I'm surprised by what they emphasize. They do this whole defense-in-depth thing, of course, but I'm struck that what they rely on most seems to be a form of testing that most smaller companies don't use at all. It's that they very carefully collect all sorts of metrics from their apps. They log every api call, every database transaction, every mouse click. They have realtime dashboards agglomerating these logs into all sorts of metrics of how their software is being used. This ensures that they can rapidly detect any production breakages. When releasing new features, they can deploy them to a just a tiny slice of users to start with, and watch the metrics to determine if any areas of usage get noticeably worse or noticeably better. In a way, it's the ultimate confirmation - the proof of the pudding, etc. It catches a class of issues which the other layers of testing wouldn't even know how to look for: Changes, maybe intentional changes, that don't break the app, they just make it a little less useful, so a higher percentage of users give up and leave.

Ambiguity

Moral certitude around testing is a problem for software quality. It's hard to improve a system which is ossified under a giant suite of bad tests, and it’s harder yet when your culture treats a suggestion of "Maybe this module doesn't need unit tests" as the ethical equivalent of "Maybe we should embezzle from the company". You can't have a discourse on how to test most effectively, when the only acceptable answer is "More unit tests".

The reality is so much messier. Absolute certainty is out of reach, and what we get instead is layers on layers of imperfect answers. But, with a thoughtful, pragmatic approach, we can still create software of extremely high quality.

I also hear them called “Unit Tests” on occasion, though that doesn’t seem to be accurate. I suspect that when people do this, they have quietly come to the same conclusion I have, but don't want to get called "unprofessional" by Uncle Bob for not writing "unit tests".

This might also explain the huge variance among studies - that the benefits of unit tests are highly concentrated in certain kinds of code

There are certainly people out there who would argue “And this is precisely why you shouldn’t use a framework like React - it makes it hard to do unit testing.” To me, this seems like the epitome of Test-Induced Design Damage

A subtle, gradual change to the way we develop in the last 20 years, has been that you didn't used to have nearly as dependable a set of libraries. You used to build a lot more of this sort of thing yourself. I suspect a part of the devotion you see to unit testing is a holdover from this era

Axten Software

Discussion about this post

Ready for more?