Class 4.1: Diagnostics and classification

Methodology of Scientific Research

Andrés Aravena, PhD

7 April 2021

Example: diagnosis

As part of the strategy to control COVID-19, many governments carry on random sampling of the population looking for asymptomatic cases.

Imagine that you are randomly chosen for a test of COVID-19. The test result is “positive”, that is, it says that you have the virus. You also know that the test sometimes fails, giving either a false positive or a false negative. Then the question is what is the probability that you have COVID-19 given that the test said “positive”?

Context

Let’s assume that:

There are \(10^{5}\) people tested
The test has a precision of 99%
The prevalence of COVID in the population is 0.1%
The people to test is chosen randomly from the population

Since this context will be the same in all cases, we will not write it explicitly

Let’s fill this matrix

	Test-	Test+	Total
COVID-	.	.	.
COVID+	.	.	.
Total	.	.	.

COVID reality in the rows and test results in the columns

We start with the total population

	Test-	Test+	Total
COVID-	.	.	.
COVID+	.	.	.
Total	.	.	1e+05

We will fill this matrix in the following slides

A large population size help us to see small values

0.1% of them are COVID positive

	Test-	Test+	Total
COVID-	.	.	99900
COVID+	.	.	100
Total	.	.	1e+05

Prevalence is the percentage of the population that has COVID.
In other words, it is the probability of (COVID₊) \[ \begin{aligned} ℙ(\text{COVID}_+) & =0.1\% = 0.001\\ ℙ(\text{COVID}_-) & =99.9\%=0.999 \end{aligned} \]

99% are correctly diagnosed

	Test-	Test+	Total
COVID-	.	.	99900
COVID+	.	99	100
Total	.	.	1e+05

Precision is the probability of a correct diagnostic \[ℙ(\text{test}_+ \vert \text{COVID}_+)=0.99\] We fill the box corresponding to (test₊,COVID₊) \[ℙ(\text{test}_+, \text{COVID}_+)=ℙ(\text{test}_+ \vert \text{COVID}_+)\cdotℙ(\text{COVID}_+)\]

99% are correctly diagnosed

	Test-	Test+	Total
COVID-	98901	.	99900
COVID+	.	99	100
Total	.	.	1e+05

In this case the precision for negative cases is the same \[ℙ(\text{test}_- | \text{COVID}_-)=0.99\] We fill the box corresponding to (test_-,COVID_-) \[ℙ(\text{test}_-, \text{COVID}_-)=ℙ(\text{test}_- | \text{COVID}_-)⋅ℙ(\text{COVID}_-)\]

1% are misdiagnosed

	Test-	Test+	Total
COVID-	98901	999	99900
COVID+	1	99	100
Total	.	.	1e+05

Misdiagnostic is the negation of good diagnostic \[ℙ(\text{test}_- | \text{COVID}_+)=1-ℙ(\text{test}_+ | \text{COVID}_+)=0.01\] we combine them in the same way as before \[ℙ(\text{test}_-, \text{COVID}_+)=ℙ(\text{test}_- | \text{COVID}_+)⋅ ℙ(\text{COVID}_+)\]

Total people diagnosed

	Test-	Test+	Total
COVID-	98901	999	99900
COVID+	1	99	100
Total	98902	1098	1e+05

We sum and fill the empty boxes

1098 people got positive test, but only 99 of them have COVID \[ℙ(\text{COVID}_+ | \text{test}_+)=\frac{99}{1098} = 9.02\%\]

Diagnostics are classifiers

Confusion Matrix

	Yes	No	Test
True	True Positive	False Negative	All True
False	False Positive	True Negative	All False
Reality	Detected	Not detected	All cases

Other values that can be calculated

Sensitivity, specificity
Precision, Recall
F-index
Matthews correlation coefficient (MCC)

Values

“All the truth” \[\textrm{Sensitivity}=\frac{\textrm{True Positives}}{\textrm{All True}}\] “Nothing but the truth” \[\textrm{Specificity}=\frac{\textrm{True negatives}}{\textrm{All False}}\] \[\textrm{Accuracy}=\frac{\textrm{True Positives+True negatives}}{\textrm{All Cases}}\]

Values

\[\textrm{Precision}=\frac{\textrm{True Positives}}{\textrm{Detected}}\] \[\textrm{Recall}=\frac{\textrm{True Positives}}{\textrm{All True}}\] \[\frac{1}{\textrm{F-index}}=\frac{1}{2}\left(\frac{1}{\textrm{Precision}}+\frac{1}{\textrm{Recall}}\right)\]