I don’t know why, but it took me a little while to properly make sense of these diagnostics, so I wanted to develop a very simple illustration of the logic behind these concepts. ROC stands for Receiver Operating Characteristics, while AUC is the area under this curve, which is used as a metric for model performance in a classification problem. Perfomance is measured as the ability to maximise true positives, while minimising false positives.

Let’s say we have some data on whether or not a bug (invertebrate) is a true bug (Hemiptera). Here, our response variable `y`

is 1 for a true bug and 0 for any other bug. Now say we have a model that outputs the probability of a true bug `f`

from some feature set (like how many legs or antennae segments each bug has). Here `f`

is the model prediction (I haven’t shown how this prediction was developed as there are endless models that will output a prediction like this from some feature set). Note that the data set is sorted by `f`

from highest to lowest.

```
library(tidyverse)
d = tibble(y = c(0, 0, 0, 0 ,1, 1, 0, 1, 1, 1),
f = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)/10) # important they are sorted
knitr::kable(d)
```

y | f |
---|---|

0 | 0.0 |

0 | 0.1 |

0 | 0.2 |

0 | 0.3 |

1 | 0.4 |

1 | 0.5 |

0 | 0.6 |

1 | 0.7 |

1 | 0.8 |

1 | 0.9 |

What is the rate of true and false positives? Well, it depends on what treshhold we use for our model output `f`

. For example, we might interpret any value of `f`

above 0.5 as a true bug prediction. Or we could be more conservative and say we will interpret as a true bug as any value above 0.2 (e.g. we might not want to miss a true bug but can tolerate some false positives). Each of these scenarios will lead to different rates of true and false positives.

```
d %>%
mutate(`f>0.5` = ifelse(f >= 0.5, 'predict=1', 'predict=0'),
`f>0.2` = ifelse(f >= 0.2, 'predict=1', 'predict=0')) %>%
knitr::kable()
```

y | f | f>0.5 | f>0.2 |
---|---|---|---|

0 | 0.0 | predict=0 | predict=0 |

0 | 0.1 | predict=0 | predict=0 |

0 | 0.2 | predict=0 | predict=1 |

0 | 0.3 | predict=0 | predict=1 |

1 | 0.4 | predict=0 | predict=1 |

1 | 0.5 | predict=1 | predict=1 |

0 | 0.6 | predict=1 | predict=1 |

1 | 0.7 | predict=1 | predict=1 |

1 | 0.8 | predict=1 | predict=1 |

1 | 0.9 | predict=1 | predict=1 |

The idea of the ROC curve is to calculate the rate of true positives and false positives for all threshold values of `f`

and plot the results.

```
f_truepos <- d$f[d$y == 1]
f_falsepos <- d$f[d$y == 0]
d$truepos <- NULL
d$falsepos <- NULL
for (i in 1:nrow(d)){
d$truepos[i] <-mean(f_truepos >= d$f[i])
d$falsepos[i] <- mean(f_falsepos >= d$f[i])
}
ggplot(d, aes(falsepos, truepos, label = paste0('f>',f))) +
geom_abline(slope = 1, intercept = 0, linetype = 2, col='red') +
geom_point(col='white') +
geom_path(col='red') +
scale_x_continuous(lim = c(0, 1.05), breaks = 1:5 * 0.2) +
scale_y_continuous(lim = c(0, 1.0), breaks = 1:5 * 0.2) +
geom_text(size = 3, nudge_x = 0.05, nudge_y = -0.05, , col='white') +
theme_minimal() +
mydarktheme
```

We can see our previous manual calculations using the prediction thresholds f>0.2 and f>0.5 as well as all other unique thresholds. When f>0.2 we have a true positive rate of 5/5 = 1 and a false positive rate of 3/5 = 0.6. When f>0.5 we have a true positive rate of 4/5 = 0.8 and a false positive rate of 1/5 = 0.2.

Hopefully, it should be clear that the more values we have in the upper left hand corner (high true positives and low false positives) the better our model is doing. It follows that the area under this ROC curve (AUC) will reflect the ability of the model to correctly catergorise the data, with a maximum value of 1 and minimum of 0. A value of around 0.5 is the case where the model output is doing no better than change (the dotted line), while values under 0.5 suggest that the model is performing worse than chance. Here, from visual inspection we can see that the AUC is 1 - 0.4 x 0.2 = 0.92.

What I previously found confusing about ROC curves is that the curve is generated for varying threshold values of `f`

. I found that labelling the threshold values on the plot for each respective true and false positive rate helped me to visualise this, but is also handy for deciding what threshold value you might like to use in a practical application of the model.

Finally, it is worth mentioning that the rate of true positives is also called sensitivity or recall and rate of false positives is also called 1 - specificity, where specificity is the rate of true negatives.