Define Quality Before You Measure It

At the end of every semester, students rate their teachers. The teacher who made class enjoyable, kept things light, and gave generous grades consistently scores well. The teacher who assigned difficult work, gave honest feedback, and held students to a high standard often scores lower, sometimes much lower. Both scores go into the same system, labeled “teacher performance.” If you run a school on those numbers long enough, you will have very popular teachers. Whether you will have very good ones depends on a question nobody asked before the ratings started: what does a high-performing teacher actually look like?

That question is harder than it sounds. And if you don’t answer it before you build the measurement, the measurement will answer it for you.

The distinction worth holding is between experience and quality. Experience is what something felt like: Was it enjoyable? Was it easy? Did I get what I came for? Quality is what actually happened: Did I learn something? Did I develop a skill? Did it make me better in a way I can point to? These two things often align. They also diverge. And when they diverge, optimizing for the measurable one while calling it the important one is how you end up selecting for the wrong thing.

Two axes help locate where any given metric actually sits. The first is observability: what can the person rating something actually assess? A student can assess whether a teacher was clear, patient, and fair. A student cannot easily assess whether the curriculum was rigorous, whether the teaching approach built durable understanding rather than exam performance, or whether the teacher’s standards prepared them for what came next. Ratings are only valid for the first category. For the second, they’re noise, sometimes actively misleading. The second axis is incentive direction: if someone optimizes for this metric, do they produce more of what you actually want? A teacher who optimizes for student satisfaction grades easier, assigns less demanding work, and avoids the discomfort of honest feedback. That behavior will show up in the ratings as improvement. Whether it reflects better teaching is a different question entirely.

The metric you choose is the behavior you reward. Measure ratings, and over time you will have people who are very good at generating ratings. Whether that overlaps with what you actually wanted depends entirely on how carefully you defined the target before you started measuring.

Satisfaction captures a real but narrow slice of quality. Whether someone communicated clearly, whether they were patient, whether they explained things in a way you could follow: these are things the recipient is genuinely qualified to assess. A measurement system designed around that narrow slice is useful and honest. One framed as measuring “performance” without drawing that line will be used to make conclusions it cannot support.

The easiest version of the job often scores best. A teacher who never challenges students, a consultant who tells clients what they want to hear, a doctor who prescribes whatever the patient requests. All of these are optimized for satisfaction. None of them are doing the job well. If your metric cannot distinguish between these and their rigorous counterparts, you have a popularity contest, not a performance system.

The hardest part of any job is rarely visible to the person on the receiving end. Knowing when to push back, when to escalate, when the right answer is “no” or “not yet” or “you need someone else”: this judgment is the most important quality attribute in almost any knowledge role, and the hardest to see from the outside. Someone who exercises it consistently may frustrate people in the short term. If your metric penalizes that, you have built something that selects against the judgment you need most.

Quality attributes must be defined before you design measurement, not after. A rating system is an instrument. Like any instrument, it measures what it is calibrated to measure. If you build it before deciding what high performance actually looks like, you haven’t avoided the definition problem. You have let the instrument make the decision by default, and you will only discover what it chose when the behavior starts to change.

Before building any measurement system, answer these. What specifically are we trying to learn that we cannot learn another way? Who is actually qualified to assess that thing? What decision will we make differently based on what this tells us? And if the measurement says one thing while the actual outcomes say another, which one wins?

The edge case worth naming: satisfaction data is not useless. It surfaces real problems that other signals miss entirely. A teacher who is technically skilled but dismissive, who humiliates students, who communicates in ways that leave people more confused than when they started. These failures won’t show up in exam results. Satisfaction catches them. The design question is not whether to collect it but what claims you allow it to support.

Build the measurement. But first write down what high performance actually means, in terms specific enough that two people reading your definition would make the same call. Otherwise you will have a dashboard, a number, and no idea what you are actually looking at.