Clasification and decision errors



Modeling and inference

Data Science with R

Is the e-mail spam or not?

\(\mathbf{x}\): Word and character counts, etc. in an e-mail

\[ y = \begin{cases} 1 & \text{it's spam}\\ 0 & \text{it's legit} \end{cases} \]

Prediction / clssification

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)

  • False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

Sensitivity and specificity

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)

    • Sensitivity = 1 − False negative rate
  • Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)

    • Specificity = 1 − False positive rate

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?