Data Science Simplified Part 10: An Introduction to Classification Models
Webster defines classification as follows:
A systematic arrangement in groups or categories according to established criteria.
Regression models estimate numerical variables a.k.a dependent variables. For regression models, a target is always a number. Classification models have a qualitative target. These targets are also called as categories.
In a large number of classification problems, the targets are designed to be binary. Binary implies that the target will only take a 0 or 1 value. These type of classifiers are called as binary classifiers. Let us take an example to understand this.
Bucket 1: Potential defaulters.
Bucket 2: Potential non-defaulters.
Linear and Non-Linear Classifiers
Anyone who falls on the left side of the line is a potential defaulter.
Anyone who falls on the left side of the line is a potential non-defaulter.
Divide the feature space with a function (linear or non-linear). Divide it such that one part of the feature space has data from one class. The other part of the feature space has data from other class
We have an intuition of how classifiers work. How do we measure whether a classifier is doing a good job or not? Here comes the concept of the confusion matrix.
Let us take an example to understand this concept. We built a loan-defaulter classifier. This classifier takes input data, trains on it and following is what it learns.
- The classifier classifies 35 applicants as defaulters.
- The classifier classifies 65 applicants as non-defaulters.
Based on the way classifier has performed, four more metrics are derived:
- From those classified as defaulters, only 12 were actual defaulters. This metric is called True Positive (TP).
- From those classified as defaulters, 23 were actual non-defaulters. This metric is called False Positive (FP).
- From those classified as non-defaulters, only 57 were actual non-defaulters. This metric is called True Negative (TN).
- From those classified as non-defaulters, 8 were actual defaulters. This metric is called False Negative (FN).
These four metrics can be tabulated in a matrix called as The Confusion Matrix.
From these four metrics, we will derive evaluation metrics for a classifier. Let us discuss these evaluation metrics.
Accuracy measures how often the classifier is correct for both true positives and true negative cases. Mathematically, it is defined as:
Accuracy = (True Positive + True Negative)/Total Predictions.
In the example, the accuracy of the loan-default classifier is: (12+57) / 100 = 0.69 = 69%.
Sensitivity or Recall:
Recall measures how many times did the classifier get the true positives correct. Mathematically, it is defined as:
Recall = True Positive/(True Positive + False Negative)
In the example, the recall of the loan-default classifier is: 12/(12+8) = 0.60 = 60%.
Specificity measure how many times did the classifier get the true negatives correct. Mathematically, it is defined as:
Specificity = (True Negative)/(True Negative + False Positive)
In the example, the specificity of the loan-default classifier is: 57/(57+23) = 0.7125 = 71.25%.
Precision measures off the total predicted to be positive how many were actually positive. Mathematically, it is defined as:
Precision = (True Positive)/(True Positive + False Positive)
In the example, the precision of the loan-default classifier is: 12/(12+23) = 0.48 = 48%.
These are a lot of metrics. On which metrics should we rely upon? This question very much depends on the business context. In any case, one metrics alone will not give a full picture of how good the classifier is. Let us take an example.
We built a classifier that flags out fraudulent transactions. This classifier determines whether a transaction is genuine or not. Historical patterns shows that there are two fraudulent transaction for every hundred transactions. The classifier we built has the following confusion matrix.
- The Accuracy is 98%
- The Recall is 100%
- Precision is 98%
- Specificity is 0%
F1-score is the harmonic mean between precision and recall. The regular mean treats all values equally. Harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high. It is defined as:
F1 = 2x(precision x recall)/(precision + recall)
Receiver Operating Characteristics (ROC) and Area Under Curve (AUC):
Receiver Operating Characteristics a.k.a ROC is a visual metrics. It is a two-dimensional plot. It has False Positive Rate or 1 – specificity on X-axis and True Positive Rate or Sensitivity on Y-axis.
In the ROC plot, there is a line that measures how a random classifier will predict TPR and FPR. It is straight as it has an equal probability of predicting 0 or 1.
If a classifier is doing a better job then it should ideally have more proportion of TPR as compared to FPR. This will push the curve towards the north-west.
Area Under Curve (AUC) is the area that the ROC curve. If AUC is 1 i.e 100%, it implies that it is a perfect classifier. If the AUC is 0.5 i.e. 50%, it implies that the classifier is no better than a coin toss.
In this post, we have seen basics of a classifier. Classifiers are ubiquitous in data science. There are many algorithms that implement classifiers. Each has their own strengths and weaknesses. We will discuss a few algorithms in the next posts of this series.