202312 - apenwarr

NPS, the good parts

The Net Promoter Score (NPS) is a statistically questionable way to turn a set of 10-point ratings into a single number you can compare with other NPSes. That's not the good part.

Humans

To understand the good parts, first we have to start with humans. Humans have emotions, and those emotions are what they mostly use when asked to rate things on a 10-point scale.

Almost exactly twenty years ago, I wrote about sitting on a plane next to a musician who told me about music album reviews. The worst rating an artist can receive, he said, is a lukewarm one. If people think your music is neutral, it means you didn't make them feel anything at all. You failed. Someone might buy music that reviewers hate, or buy music that people love, but they aren't really that interested in music that is just kinda meh. They listen to music because they want to feel something.

(At the time I contrasted that with tech reviews in computer magazines (remember those?), and how negative ratings were the worst thing for a tech product, so magazines never produced them, lest they get fewer free samples. All these years later, journalism is dead but we're still debating the ethics of game companies sponsoring Twitch streams. You can bet there's no sponsored game that gets an actively negative review during 5+ hours of gameplay and still gets more money from that sponsor. If artists just want you to feel something, but no vendor will pay for a game review that says it sucks, I wonder what that says about video game companies and art?)

Anyway, when you ask regular humans, who are not being sponsored, to rate things on a 10-point scale, they will rate based on their emotions. Most of the ratings will be just kinda meh, because most products are, if we're honest, just kinda meh. I go through most of my days using a variety of products and services that do not, on any more than the rarest basis, elicit any emotion at all. Mostly I don't notice those. I notice when I have experiences that are surprisingly good, or (less surprisingly but still notably) bad. Or, I notice when one of the services in any of those three categories asks me to rate them on a 10-point scale.

The moment

The moment when they ask me is important. Many products and services are just kinda invisibly meh, most of the time, so perhaps I'd give them a meh rating. But if my bluetooth headphones are currently failing to connect, or I just had to use an airline's online international check-in system and it once again rejected my passport for no reason, then maybe my score will be extra low. Or if Apple releases a new laptop that finally brings back a non-sucky keyboard after making laptops with sucky keyboards for literally years because of some obscure internal political battle, maybe I'll give a high rating for a while.

If you're a person who likes manipulating ratings, you'll figure out what moments are best for asking for the rating you want. But let's assume you're above that sort of thing, because that's not one of the good parts.

The calibration

Just now I said that if I'm using an invisible meh product or service, I would rate it with a meh rating. But that's not true in real life, because even though I was having no emotion about, say, Google Meet during a call, perhaps when they ask me (after every...single...call) how it was, that makes me feel an emotion after all. Maybe that emotion is "leave me alone, you ask me this way too often." Or maybe I've learned that if I pick anything other than five stars, I get a clicky multi-tab questionnaire that I don't have time to answer, so I almost always pick five stars unless the experience was so bad that I feel it's worth an extra minute because I simply need to tell the unresponsive and uncaring machine how I really feel.

Google Meet never gets a meh rating. It's designed not to. In Google Meet, meh gets five stars.

Or maybe I bought something from Amazon and it came with a thank-you card begging for a 5-star rating (this happens). Or a restaurant offers free stuff if I leave a 5-star rating and prove it (this happens). Or I ride in an Uber and there's a sign on the back seat talking about how they really need a 5-star rating because this job is essential so they can support their family and too many 4-star ratings get them disqualified (this happens, though apparently not at UberEats). Okay. As one of my high school teachers, Physics I think, once said, "A's don't cost me anything. What grade do you want?" (He was that kind of teacher. I learned a lot.)

I'm not a professional reviewer. Almost nobody you ask is a professional reviewer. Most people don't actually care; they have no basis for comparison; just about anything will influence their score. They will not feel badly about this. They're just trying to exit your stupid popup interruption as quickly as possible, and half the time they would have mashed the X button instead but you hid it, so they mashed this one instead. People's answers will be... untrustworthy at best.

That's not the good part.

And yet

And yet. As in so many things, randomness tends to average out, probably into a Gaussian distribution, says the Central Limit Theorem.

The Central Limit Theorem is the fun-destroying reason that you can't just average 10-point ratings or star ratings and get something useful: most scores are meh, a few are extra bad, a few are extra good, and the next thing you know, every Uber driver is a 4.997. Or you can ship a bobcat one in 30 times and still get 97% positive feedback.

There's some deep truth hidden in NPS calculations: that meh ratings mean nothing, that the frequency of strong emotions matters a lot, and that deliriously happy moments don't average out disastrous ones.

Deming might call this the continuous region and the "special causes" (outliers). NPS is all about counting outliers, and averages don't work on outliers.

The degrees of meh

Just kidding, there are no degrees of meh. If you're not feeling anything, you're just not. You're not feeling more nothing, or less nothing.

One of my friends used to say, on a scale of 6 to 9, how good is this? It was a joke about how nobody ever gives a score less than 6 out of 10, and nothing ever deserves a 10. It was one of those jokes that was never funny because they always had to explain it. But they seemed to enjoy explaining it, and after hearing the explanation the first several times, that part was kinda funny. Anyway, if you took the 6-to-9 instructions seriously, you'd end up rating almost everything between 7 and 8, just to save room for something unimaginably bad or unimaginably good, just like you did with 1-to-10, so it didn't help at all.

And so, the NPS people say, rather than changing the scale, let's just define meaningful regions in the existing scale. Only very angry people use scores like 1-6. Only very happy people use scores like 9 or 10. And if you're not one of those you're meh. It doesn't matter how meh. And in fact, it doesn't matter much whether you're "5 angry" or "1 angry"; that says more about your internal rating system than about the degree of what you experienced. Similarly with 9 vs 10; it seems like you're quite happy. Let's not split hairs.

So with NPS we take a 10-point scale and turn it into a 3-point scale. The exact opposite of my old friend: you know people misuse the 10-point scale, but instead of giving them a new 3-point scale to misuse, you just postprocess the 10-point scale to clean it up. And now we have a 3-point scale with 3 meaningful points. That's a good part.

Evangelism

So then what? Average out the measurements on the newly calibrated 1-2-3 scale, right?

Still no. It turns out there are three kinds of people: the ones so mad they will tell everyone how mad they are about your thing; the ones who don't care and will never think about you again if they can avoid it; and the ones who had such an over-the-top amazing experience that they will tell everyone how happy they are about your thing.

NPS says, you really care about the 1s and the 3s, but averaging them makes no sense. And the 2s have no effect on anything, so you can just leave them out.

Cool, right?

Pretty cool. Unfortunately, that's still two valuable numbers but we promised you one single score. So NPS says, let's subtract them! Yay! Okay, no. That's not the good part.

The threefold path

I like to look at it this way instead. First of all, we have computers now, we're not tracking ratings on one of those 1980s desktop bookkeeping printer-calculators, you don't have to make every analysis into one single all-encompassing number.

Postprocessing a 10-point scale into a 3-point one, that seems pretty smart. But you have to stop there. Maybe you now have three separate aggregate numbers. That's tough, I'm sorry. Here's a nickel, kid, go sell your personal information in exchange for a spreadsheet app. (I don't know what you'll do with the nickel. Anyway I don't need it. Here. Go.)

Each of those three rating types gives you something different you can do in response:

The ones had a very bad experience, which is hopefully an outlier, unless you're Comcast or the New York Times subscription department. Normally you want to get rid of every bad experience. The absence of awful isn't greatness, it's just meh, but meh is infinitely better than awful. Eliminating negative outliers is a whole job. It's a job filled with Deming's special causes. It's hard, and it requires creativity, but it really matters.
The twos had a meh experience. This is, most commonly, the majority. But perhaps they could have had a better experience. Perhaps even a great one? Deming would say you can and should work to improve the average experience and reduce the standard deviation. That's the dream; heck, what if the average experience could be an amazing one? That's rarely achieved, but a few products achieve it, especially luxury brands. And maybe that Broadway show, Hamilton? I don't know, I couldn't get tickets, because everyone said it was great so it was always sold out and I guess that's my point.

If getting the average up to three is too hard or will take too long (and it will take a long time!), you could still try to at least randomly turn a few of them into threes. For example, they say users who have a great customer support experience often rate a product more highly than the ones who never needed to contact support at all, because the support interaction made the company feel more personal. Maybe you can't afford to interact with everyone, but if you have to interact anyway, perhaps you can use that chance to make it great instead of meh.
The threes already had an amazing experience. Nothing to do, right? No! These are the people who are, or who can become, your superfan evangelists. Sometimes that happens on its own, but often people don't know where to put that excess positive energy. You can help them. Pop stars and fashion brands know all about this; get some true believers really excited about your product, and the impact is huge. This is a completely different job than turning ones into twos, or twos into threes.

What not to do

Those are all good parts. Let's ignore that unfortunately they aren't part of NPS at all and we've strayed way off topic.

From here, there are several additional things you can do, but it turns out you shouldn't.

Don't compare scores with other products. I guarantee you, your methodology isn't the same as theirs. The slightest change in timing or presentation will change the score in incomparable ways. You just can't. I'm sorry.

Don't reward your team based on aggregate ratings. They will find a way to change the ratings. Trust me, it's too easy.

Don't average or difference the bad with the great. The two groups have nothing to do with each other, require completely different responses (usually from different teams), and are often very small. They're outliers after all. They're by definition not the mainstream. Outlier data is very noisy and each terrible experience is different from the others; each deliriously happy experience is special. As the famous writer said, all meh families are alike.

Don't fret about which "standard" rating ranges translate to bad-meh-good. Your particular survey or product will have the bad outliers, the big centre, and the great outliers. Run your survey enough and you'll be able to find them.

Don't call it NPS. NPS nowadays has a bad reputation. Nobody can really explain the bad reputation; I've asked. But they've all heard it's bad and wrong and misguided and unscientific and "not real statistics" and gives wrong answers and leads to bad incentives. You don't want that stigma attached to your survey mechanic. But if you call it a satisfaction survey on a 10-point or 5-point scale, tada, clear skies and lush green fields ahead.

Bonus advice

Perhaps the neatest thing about NPS is how much information you can get from just one simple question that can be answered with the same effort it takes to dismiss a popup.

I joked about Google Meet earlier, but I wasn't really kidding; after having a few meetings, if I had learned that I could just rank from 1 to 5 stars and then not get guilted for giving anything other than 5, I would do it. It would be great science and pretty unobtrusive. As it is, I lie instead. (I don't even skip, because it's faster to get back to the menu by lying than by skipping.)

While we're here, only the weirdest people want to answer a survey that says it will take "just 5 minutes" or "just 30 seconds." I don't have 30 seconds, I'm busy being mad/meh/excited about your product, I have other things to do! But I can click just one single star rating, as long as I'm 100% confident that the survey will go the heck away after that. (And don't even get me started about the extra layer in "Can we ask you a few simple questions about our website? Yes or no")

Also, don't be the survey that promises one question and then asks "just one more question." Be the survey that gets a reputation for really truly asking that one question. Then ask it, optionally, in more places and more often. A good role model is those knowledgebases where every article offers just thumbs up or thumbs down (or the default of no click, which means meh). That way you can legitimately look at aggregates or even the same person's answers over time, at different points in the app, after they have different parts of the experience. And you can compare scores at the same point after you update the experience.

But for heaven's sake, not by just averaging them.

2023-12-04 »