Modeling and inference – Clasification and decision errors

Is the e-mail spam or not?

\(\mathbf{x}\): Word and character counts, etc. in an e-mail

\[ y = \begin{cases} 1 & \text{it's spam}\\ 0 & \text{it's legit} \end{cases} \]

Subject: Congratulations! You’ve Been Selected for an Exclusive Reward 🎁

Dear Customer,

You have been chosen as one of our preferred recipients to receive a special complimentary gift. This is our way of thanking you for your continued interest in our services.

To claim your reward, simply complete our short survey. Your participation takes only 60 seconds, and your prize will be shipped at no cost to you.

Click here to start your survey and claim your reward [Claim Reward Link]

This exclusive offer is available for the next 48 hours only. Don’t miss your chance to enjoy this limited opportunity.

Warm regards,
Promotions Team
Exclusive Rewards Center

Prediction / clssification

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)
False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

Sensitivity and specificity

	Email is spam	Email is not spam
Email labelled spam	True positive	False positive (Type 1 error)
Email labelled not spam	False negative (Type 2 error)	True negative

Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
- Sensitivity = 1 − False negative rate

Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
- Specificity = 1 − False positive rate

If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?