Data Science Simplified Part 10: An Introduction to Classification Models
Webster defines classification as follows:
A systematic arrangement in groups or categories according to established criteria.
Classification Categories
Regression models estimate numerical variables a.k.a dependent variables. For regression models, a target is always a number. Classification models have a qualitative target. These targets are also called as categories.
In a large number of classification problems, the targets are designed to be binary. Binary implies that the target will only take a 0 or 1 value. These type of classifiers are called as binary classifiers. Let us take an example to understand this.

Bucket 1: Potential defaulters.

Bucket 2: Potential nondefaulters.
Linear and NonLinear Classifiers

Anyone who falls on the left side of the line is a potential defaulter.

Anyone who falls on the left side of the line is a potential nondefaulter.
Divide the feature space with a function (linear or nonlinear). Divide it such that one part of the feature space has data from one class. The other part of the feature space has data from other class
Evaluating Classifiers
We have an intuition of how classifiers work. How do we measure whether a classifier is doing a good job or not? Here comes the concept of the confusion matrix.
Let us take an example to understand this concept. We built a loandefaulter classifier. This classifier takes input data, trains on it and following is what it learns.
 The classifier classifies 35 applicants as defaulters.
 The classifier classifies 65 applicants as nondefaulters.
Based on the way classifier has performed, four more metrics are derived:
 From those classified as defaulters, only 12 were actual defaulters. This metric is called True Positive (TP).
 From those classified as defaulters, 23 were actual nondefaulters. This metric is called False Positive (FP).
 From those classified as nondefaulters, only 57 were actual nondefaulters. This metric is called True Negative (TN).
 From those classified as nondefaulters, 8 were actual defaulters. This metric is called False Negative (FN).
These four metrics can be tabulated in a matrix called as The Confusion Matrix.
From these four metrics, we will derive evaluation metrics for a classifier. Let us discuss these evaluation metrics.
Accuracy:
Accuracy measures how often the classifier is correct for both true positives and true negative cases. Mathematically, it is defined as:
Accuracy = (True Positive + True Negative)/Total Predictions.
In the example, the accuracy of the loandefault classifier is: (12+57) / 100 = 0.69 = 69%.
Sensitivity or Recall:
Recall measures how many times did the classifier get the true positives correct. Mathematically, it is defined as:
Recall = True Positive/(True Positive + False Negative)
In the example, the recall of the loandefault classifier is: 12/(12+8) = 0.60 = 60%.
Specificity:
Specificity measure how many times did the classifier get the true negatives correct. Mathematically, it is defined as:
Specificity = (True Negative)/(True Negative + False Positive)
In the example, the specificity of the loandefault classifier is: 57/(57+23) = 0.7125 = 71.25%.
Precision:
Precision measures off the total predicted to be positive how many were actually positive. Mathematically, it is defined as:
Precision = (True Positive)/(True Positive + False Positive)
In the example, the precision of the loandefault classifier is: 12/(12+23) = 0.48 = 48%.
These are a lot of metrics. On which metrics should we rely upon? This question very much depends on the business context. In any case, one metrics alone will not give a full picture of how good the classifier is. Let us take an example.
We built a classifier that flags out fraudulent transactions. This classifier determines whether a transaction is genuine or not. Historical patterns shows that there are two fraudulent transaction for every hundred transactions. The classifier we built has the following confusion matrix.
 The Accuracy is 98%
 The Recall is 100%
 Precision is 98%
 Specificity is 0%
F1 Score:
F1score is the harmonic mean between precision and recall. The regular mean treats all values equally. Harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high. It is defined as:
F1 = 2x(precision x recall)/(precision + recall)
Receiver Operating Characteristics (ROC) and Area Under Curve (AUC):
Receiver Operating Characteristics a.k.a ROC is a visual metrics. It is a twodimensional plot. It has False Positive Rate or 1 – specificity on Xaxis and True Positive Rate or Sensitivity on Yaxis.
In the ROC plot, there is a line that measures how a random classifier will predict TPR and FPR. It is straight as it has an equal probability of predicting 0 or 1.
If a classifier is doing a better job then it should ideally have more proportion of TPR as compared to FPR. This will push the curve towards the northwest.
Area Under Curve (AUC) is the area that the ROC curve. If AUC is 1 i.e 100%, it implies that it is a perfect classifier. If the AUC is 0.5 i.e. 50%, it implies that the classifier is no better than a coin toss.
Conclusion
In this post, we have seen basics of a classifier. Classifiers are ubiquitous in data science. There are many algorithms that implement classifiers. Each has their own strengths and weaknesses. We will discuss a few algorithms in the next posts of this series.
4 Comments »