Bad Metrics are Worse than No Metrics
Modern industry is permeated by a constant need to measure, whether we’re testing performance optimisations, or gathering data for performance reviews. Not all things are easy to measure directly though, which leaves us reaching for proxy metrics: correlations that are easier to track than the impact we actually care about. The problem is that once you start measuring something, people start optimising for the measurement itself… and that doesn’t always lead to the impact you were seeking.
In fact, sometimes the simple act of trying to measure a behaviour we want can be so destructive that it would be better not to measure it at all. To illustrate this, we’re going to talk about my hypothetical Tiling company, MegaTilers, the fastest tilers in the trade. Or so we claim.
Tiling Speed
So how can we justify this vaulted claim? It’s simple really: We measure the number of tiles used per week!
Choosing a Metric
How did we land on this metric? We needed a way to measure the productivity of our tilers. We started by looking at number of jobs completed, but we quickly realised this wasn’t appropriate - some jobs involve tiling multiple rooms, while others are very quick. As a company, we generally make a better profit off larger jobs, so we didn’t want to choose a metric where better numbers meant focusing on less profitable jobs.
“Tiles Used” worked great, though. It’s not perfect - what metric is? - but it scales nicely with larger jobs, and it’s really easy to measure, because employees have to order any tiles they use from our central warehouse. Now all we had to do was monitor whether the number of tiles being ordered - and not returned at the end of a job - was increasing over time.
Measurable Impact
At first, everything was fine. Then we went through a performance review period where employees first heard about this new metric. After that, everything went a bit weird.
The first thing we noticed was a dramatic uptick in the metric. Huge improvements, almost overnight! More tiles than ever were being used. One tiler in particular doubled their tile usage in a month, quickly becoming the biggest tile consumer in the company! We gave him a bonus and featured his metric success in the company all hands. After that everyone suddenly started using more tiles, obviously following our Star Employee’s example!
But at the end of the year, we discovered that - financially - we weren’t actually any better off. In fact, things were worse! We were completing slightly more jobs than previously, but our materials cost had gone through the roof; we were using triple the number of tiles we had previously, but we hadn’t tripled the revenue accordingly. Immediately, we initiated a review.
Playing the Game
We started by reviewing the most recent work of our biggest tile consumers, and the problem immediately became clear. Knowing that we were measuring them on tiles used, rather than the speed or quality of their work, our employees had started finding “inventive” ways to use more tiles to do the same job. These included:
- Breaking many more tiles than usual when cutting to fit irregular spaces
- Doing a double layer of tiles
- Steering customers to use the smallest tile possible for a space in order to maximise tile count
- “Accidentally” tiling the wrong wall, or using the wrong tiles to begin with.
We realised we’d made a terrible mistake: by publicising a proxy metric only loosely correlated with the value we wanted to achieve, we incentivised our employees to game the system and make things far worse in the process. In trying to measure whether we were the best, we had created significant negative consequences for ourselves.
Parallels in Software
This example might seem ridiculous, but this is literally what has been going on in software engineering for years, whether LOC (lines of code) or more recently “Tokenmaxxing” where engineers are measured on how many AI tokens they use in a month.
It’s extremely easy to bloat either your token usage or your LOC:
| Token Usage Inflator | Impact |
|---|---|
| Use a more “intelligent” model | Significant increase in cost and “thinking” time for models. At best just more expensive, at worst actually slows down development |
| One single long-running context | Significant degradation in model performance, higher likelihood of bugs |
| Use /loop or similar skills to check for things waiting on a human | Basically just a waste of tokens for minimal productivity increase |
| Don’t restrict scope so the model has to search the whole codebase for context | Decrease in response speed, increase in context space (leading to the 2nd problem again) |
| Lines of Code Inflator | Impact |
|---|---|
| Unnecessarily verbose code | Harder to read, harder to maintain, more likely to spawn bugs (Increased context size for LLMs and humans alike) |
| Block comments in code | Comments are too specific about implementation, get out of sync with actual code |
| Excessive unit tests | Test suite takes longer to run, tests are too specific and break more often on code change; slows delivery speed |
| Lots of whitespace | Time wasted adding extra lines for the sake of LOC, takes longer to navigate files |
| “Copypasta” - duplicating code inline rather than using shared methods | MUCH harder to maintain, code quality reduced, increased likelihood of behaviour drift/bugs |
All of the above are extremely easy to do for the author. It’s basically no extra effort to significantly increase how good you look, but unfortunately each and every optimisation an Engineer takes will come with negative consequences and no tangible value.
Yes, if you take a very productive engineer who is hyper-engaged with agent-based development and churning out lots of valuable features, they will probably have a higher output on both metrics than someone who is coasting… at least, until the coasting engineer discovers they’re being measured on LOC and/or Tokens Used. At that point it’s virtually guaranteed they will start taking steps to improve their metrics, but it’s a lot easier to game a metric than it is to actually improve valuable output. They will not change their behaviour in the way you were hoping!
People will Game Metrics
The reality is that as soon as you start measuring people - and rewarding or punishing them financially for their performance against specific metrics - people will optimise for those metrics. It’s not even something you can really chastise them for. If they don’t optimise for it, then they won’t do as well in performance reviews; in an overwhelmingly large majority of cases, people work to make money. They will do the things that earn them more money.
If you really want to measure something, make sure you choose a metric that you don’t care if people game. Set constraints around the metrics so that it can’t be abused, or pair metrics that counterbalance, so that overly optimising for one will lead to a negative impact on the other.
If you can’t find anything that works, it may well be a case where it’s better not to measure it at all.