Machine learning algorithms have a method of learning patterns from data. The method is intuitive. The model determines the underlying pattern from a given data set. This process is called training the model. The trained model is tested on another data set that it has not seen before. The goal is always to find the optimal model. The endeavor is to hit the sweet spot where the model performs satisfactorily on both training set and test set.
The test error is the mean error that occurred when the model on the new observation makes the prediction. This new observation is not seen by the model before. The training error rate is often quite different from test error and can dramatically underestimate the latter.
As the model complexity goes up, the training error goes down with it. The training error goes down because the model’s complexity helps it to learn all the variations found in the training data. This process is called overfitting. However, since the test sample is unseen data, an over-fitted model performs poorly on test samples.
The ingredients that contribute to this behavior is the bias and variance.
Bias is defined as how far the prediction is from the actual value.
Variance is defined as how considerably the estimate varies from its average.
As the model becomes complex, the following happens:
- When the fitting is low, it implies that bias is high, and variance is small.
- As the model complexity increases, the bias goes down. A complex model can adapt to various behaviors in the data points.
- However, the variance also increases as it means that it has to estimate more and more parameters.
The Bias-Variance trade-off is the sweet spot that the model aspires to achieve.
The process of model training is as good as the data it trains on. So how do we ensure that biases in data don’t seep in model training? How do we ensure that the model is generalized enough?
Resampling methods are used to ensure that the model is good enough and can handle variations in data. The model does that by training it on the variety of patterns found in the dataset. This article discusses those resampling methods.
Validation Set Approach
Validation set approach is a simple method of sampling for training and testing. The data is split into two parts. The first part is used to train the model. The model learns from the training data. The second split of the data is used to test the model.
The validation set approach is simple. However, it comes with its own set of drawbacks.
- Firstly, what the model learns based on the training data is highly dependent on the observations included in the training set. If an outlier observation is included in the training set, the model will tend to learn from outlier observations which may not be relevant in actual data.
- Secondly, only a subset of observations is included in the training set. Excluding the observations for training means that the model will be deprived of learning the nuances of data in the test set.
In general, validation set error tends to overestimate the test errors.
We have seen the challenges of validation set approach. K-fold cross-validation method is used to overcome these challenges. This approach works as follows:
- The data is split into something called fold (k). Typically, there are 5 or 10 equal folds. Each fold has a random set of data points.
- In the first iteration, the model is trained on (k-1) folds and tested on the one left out fold.
- This process is repeated until the model is trained and tested on all the folds.
Let us take an example.
- In this example, we have a dataset. This dataset is split into ten equal folds.
- For the first iteration, nine folds are used to train the model i.e., folds 2-10.
- The model is tested on the 1st fold i.e., fold #1.
- Training and testing errors are noted for iteration 1.
- In the second iteration, again, nine folds are used to train the model. However, in this case, the fold one is used for training along with other eight folds. The training is done on fold 1, fold 3-10.
- The model is tested on the 2nd fold i.e., fold #2.
- Training and testing errors are noted for iteration 2.
- This process continues till all the folds are trained once, and the model is tested on all the folds.
The overall performance of the model is computed based on mean error across all the iterations.
- For a regression model, the mean error across all the folds can be defined as follows:
where MSE is Mean square error.
- For a classifier, the mean error across all the folds can be defined as follows:
where Err can be classifier metrics like AUC, Recall, Precision etc.
As we can see the k-fold cross-validation method eliminates a lot of drawbacks from the validation set method. It mainly does an excellent job of ensuring that bias doesn’t seep into the model performance. It does it elegantly by training and testing on each of the folds.
However, as expected, this method can be time-consuming as compared to a simplistic approach taken by the validation set approach. The time consumed is evident as the cross-validation method trains (k-1) times more than the validation set approach. This issue can be more pronounced, especially if the training set is large.
Another method of sampling data is using bootstrap sampling method. Bootstrap is a flexible and powerful statistical method that can be used to quantify the uncertainty associated with an estimator. Bootstrapping process takes the following approach:
- Rather than repeatedly obtaining independent data set from the population, we collect distinct data sets by repeatedly sampling observations from the original data set with replacement.
- Each of these bootstrap data sets is created by sampling with replacement and is the same size as our original data set.
- An observation may appear more than once in a bootstrap sample or not at all.
Let us look at an example to understand it better.
In the diagram above, there are ten observations. Bootstrap sampling works in the following manner:
- The original dataset has ten observations.
- The training set is the same size as the original dataset, i.e., training is done on ten observations. However, observations in the training sets are repeated from the original dataset. In the example above, for the first iterations, observations 2, 3, 4, and 9 are repeated from the original dataset. Observation #1 is not repeated.
- Once the model is trained, it is tested on the unseen data. Unseen data are those observations that are not in training data set but are present in the original dataset. The test data set is the original dataset – training dataset.
These three steps are repeated for bootstrap sample #2 as well. This process continues for a prescribed number of bootstrap samples (typically in the range of 1000 samples). The overall bootstrap estimate is the average of the estimates obtained from each bootstrap sample estimate.
Bootstrap estimate will have lower variance in its estimation as compared to a general train-test split mechanism.
Bootstrap sampling is advantageous in practice.
If there are a relatively fewer observation of interests, bootstrap sampling can be used to repeatedly sampling the same observation in the dataset for training.
This article illustrates three methods of resampling. The general idea is to enable the model to learn as much as possible. For the model to learn as much as possible, it should be trained on the variety of data points found in the underlying dataset. In practice, the simple validation method is used for the quick creation of the model. It is then further enhanced by using K-fold cross-validation method.
- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013.