Class 4.1: Diagnostics and classification

Methodology of Scientific Research

Andrés Aravena, PhD

7 April 2021

Example: diagnosis

As part of the strategy to control COVID-19, many governments carry on random sampling of the population looking for asymptomatic cases.

Imagine that you are randomly chosen for a test of COVID-19. The test result is “positive”, that is, it says that you have the virus. You also know that the test sometimes fails, giving either a false positive or a false negative. Then the question is what is the probability that you have COVID-19 given that the test said “positive”?

Context

Let’s assume that:

  • There are \(10^{5}\) people tested
  • The test has a precision of 99%
  • The prevalence of COVID in the population is 0.1%
  • The people to test is chosen randomly from the population

Since this context will be the same in all cases, we will not write it explicitly

Let’s fill this matrix

  Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . .

COVID reality in the rows and test results in the columns

We start with the total population

  Test- Test+ Total
COVID- . . .
COVID+ . . .
Total . . 1e+05

We will fill this matrix in the following slides

A large population size help us to see small values

0.1% of them are COVID positive

  Test- Test+ Total
COVID- . . 99900
COVID+ . . 100
Total . . 1e+05

Prevalence is the percentage of the population that has COVID.
In other words, it is the probability of (COVID+) \[ \begin{aligned} ℙ(\text{COVID}_+) & =0.1\% = 0.001\\ ℙ(\text{COVID}_-) & =99.9\%=0.999 \end{aligned} \]

99% are correctly diagnosed

  Test- Test+ Total
COVID- . . 99900
COVID+ . 99 100
Total . . 1e+05

Precision is the probability of a correct diagnostic \[ℙ(\text{test}_+ \vert \text{COVID}_+)=0.99\] We fill the box corresponding to (test+,COVID+) \[ℙ(\text{test}_+, \text{COVID}_+)=ℙ(\text{test}_+ \vert \text{COVID}_+)\cdotℙ(\text{COVID}_+)\]

99% are correctly diagnosed

  Test- Test+ Total
COVID- 98901 . 99900
COVID+ . 99 100
Total . . 1e+05

In this case the precision for negative cases is the same \[ℙ(\text{test}_- | \text{COVID}_-)=0.99\] We fill the box corresponding to (test-,COVID-) \[ℙ(\text{test}_-, \text{COVID}_-)=ℙ(\text{test}_- | \text{COVID}_-)⋅ℙ(\text{COVID}_-)\]

1% are misdiagnosed

  Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total . . 1e+05

Misdiagnostic is the negation of good diagnostic \[ℙ(\text{test}_- | \text{COVID}_+)=1-ℙ(\text{test}_+ | \text{COVID}_+)=0.01\] we combine them in the same way as before \[ℙ(\text{test}_-, \text{COVID}_+)=ℙ(\text{test}_- | \text{COVID}_+)⋅ ℙ(\text{COVID}_+)\]

Total people diagnosed

  Test- Test+ Total
COVID- 98901 999 99900
COVID+ 1 99 100
Total 98902 1098 1e+05

We sum and fill the empty boxes

1098 people got positive test, but only 99 of them have COVID \[ℙ(\text{COVID}_+ | \text{test}_+)=\frac{99}{1098} = 9.02\%\]

Diagnostics are classifiers

Confusion Matrix

  Yes No Test
True True Positive False Negative All True
False False Positive True Negative All False
Reality Detected Not detected All cases

Other values that can be calculated

  • Sensitivity, specificity
  • Precision, Recall
  • F-index
  • Matthews correlation coefficient (MCC)

Values

“All the truth” \[\textrm{Sensitivity}=\frac{\textrm{True Positives}}{\textrm{All True}}\] “Nothing but the truth” \[\textrm{Specificity}=\frac{\textrm{True negatives}}{\textrm{All False}}\] \[\textrm{Accuracy}=\frac{\textrm{True Positives+True negatives}}{\textrm{All Cases}}\]

Values

\[\textrm{Precision}=\frac{\textrm{True Positives}}{\textrm{Detected}}\] \[\textrm{Recall}=\frac{\textrm{True Positives}}{\textrm{All True}}\] \[\frac{1}{\textrm{F-index}}=\frac{1}{2}\left(\frac{1}{\textrm{Precision}}+\frac{1}{\textrm{Recall}}\right)\]