How good is my test?

5 June 2020

Louise Pryor, incoming President-elect at the Institute and Faculty of Actuaries, discusses how we interpret the accuracy of COVID-19 tests.

Louise Pryor, incoming President-elect at the Institute and Faculty of Actuaries What comes into your mind when you see a statement that a 100% accurate test for Covid-19 is on its way? Two things come into mine. First, 100% accurate is probably impossible, in practice. And second, what does “accurate” mean for a test, anyway?

On the first point, it’s always useful to remember the difference between theory and practice. In theory, there is no difference between practice and theory. In practice, there is. (And no, it apparently wasn’t Yogi Berra who came up with this.) The point is, that even if a test is absolutely spot on in theory, in practice the right protocol isn’t going to be observed every single time and things might go wrong. Maybe a throat swab isn’t taken correctly, or it gets contaminated, for example. In real life, stuff happens, and not always according to the book.

On the second point, the notion of accuracy for a test isn’t as simple as it seems. You might think that the best way of measuring it is to calculate the proportion of test instances that give the right answer, so if the test gets it right every time it’s performed, it would be 100% accurate. And that does make sense, for 100% accuracy. And it’s pretty meaningful if the accuracy is low, too: a test that’s 20% accurate doesn’t seem that useful. The problem comes when numbers such as 95% or 99% accurate are used. Is 95% accuracy a sign of a good test? You might think so.

But imagine you are testing for a condition which is quite rare – suppose only 2% of the population has it. If your test simply gave the answer “no” every time, it would actually be 98% accurate on this measure. But it’s not a good test to use. The trouble is that this simple accuracy measure is influenced by the relative proportions of people with and without the condition in the population as a whole. We need something else.

There are in fact several measures that are useful. They are based on thinking of the problem slightly differently. Let’s suppose that everyone in the population either has a certain condition or doesn’t. Then there are four possible results: the test correctly identifies someone as having the condition, falsely identifies them as having it (they are in fact healthy), correctly identifies them as being healthy, or falsely identifies them as healthy (they actually have the condition).

Null
[1] Source: AMS Blogs

The false positive rate is the proportion of healthy people who are falsely identified as having the condition, and the false negative rate is the proportion of people with the condition who the test identifies as healthy. With our bad test, the false positive rate is zero (because the test never identifies anyone as having the condition) and the false negative rate is 100%. Another way of looking at it is in terms of sensitivity and specificity – how many people with the condition does the test identify, and how good is the test at only identifying the people that actually have the condition.

Our bad test has a sensitivity of 0% (it never correctly identifies anyone as having the condition) and a specificity of 100% (everyone without the condition is identified as not having the condition).

In practice, comparatively few clinical tests have both perfect sensitivity and perfect specificity. Tests often work by detecting the presence of one or more substances such as particular proteins. And that means that you need to set some kind of threshold for deciding whether the substance(s) are present or not. Set the threshold too high, and you miss some genuine occurrences – you have more false negatives, and your sensitivity is too low. Set it too low, and you get more false positives – your specificity is too low. There are always trade-offs.

Both false negatives and false positives can have significant implications. Too many false positives, and, in the case of Covid-19, many people would be self-isolating who don’t need to, with possible bad impacts on their financial well-being or mental health. Too many false negatives, and many people who should be self-isolating won’t be, and the disease will spread more rapidly.

Null
[2] Source: By FeanDoe - Modified version from Walber's Precision and Recall

There are the same types of problem with similar trade-offs in many applications of data science, where machine learning is used to classify objects or events. For instance, too many false positives on fingerprint recognition would mean that anybody could unlock your smartphone; or you might get more calls from your bank asking you to confirm that credit card transactions are genuine. Too many false negatives, on the other hand, and you yourself won’t be able to unlock your smartphone (as your fingerprint is less likely to be recognised) and fraudulent credit card transactions are more likely to go through.

So the next time you read about some clinical test or machine learning algorithm being 95% accurate (or indeed any other level of accuracy), stop and think about what that could mean in practice. You’ll probably find that it doesn’t actually tell you very much about how useful the test is.

[1] Source: https://blogs.ams.org/mathgradblog/2016/08/06/false-positive-vs-false-ne...

[2] Source: By FeanDoe - Modified version from Walber's Precision and Recall
https://commons.wikimedia.org/wiki/File:Precisionrecall.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=65826093

You are here