Cross-validation is a powerful technique that’s widely used in machine learning to assess the performance of models and prevent overfitting.
However, as many data scientists have discovered, cross-validation can also go wrong in a number of ways, leading to inaccurate or unreliable results.
In this blog post, Data Scientist Tam Tran- The explores some of the common pitfalls of cross-validation and shows you how to fix them.
Cross-validation is a resampling procedure used to estimate the performance of machine learning models on a limited data set. This procedure is commonly used when optimising the hyper-parameters of a model and/or when evaluating performance of the final model. However, there are multiple nuances in the procedure design that might make the obtained results less robust or even wrong.
Consider that you are working on a classification problem with tabular data containing hundreds of features. You decide to select features based on their corresponding ANOVA f-statistics with the outcome label.
You first perform the feature selection strategy on the entire dataset to select the top k features (where k is an arbitrary number) with the highest f-statistics. After this, you decide to do cross-validation and feed the data with selected features into the CV loop to estimate the model performance.
Performing feature selection on full dataset before CV leads to data leakage.
Image by Author.
Here you have committed the mistake of data leakage. Since you perform a selection strategy that involves learning about the outcome label on the entire dataset, knowledge about the validation set, especially the outcome label, is made available to the model during training. This gives the model an unrealistic advantage to make better predictions, which wouldn’t happen in real-time production.
Instead of choosing an arbitrary number of features, you want to choose features whose p-value of f-statistics is smaller than a certain threshold. You think of p-value threshold as a model hyper-parameter – something you need to tune to get the best-performing set of features, thereby the best-performing model.
As CV is well-known for hyper-parameter optimization, you then evaluate a distinct set of p-value thresholds by performing the procedure on the whole dataset. The problem is that you use this CV estimate bothto choose the best p-value threshold (hence best set of features) and to report the final performance estimation.
Combining hyper-parameter tuning with model evaluation in the same CV loop leads to an optimistically biased evaluation of the model performance. Image by Author.
When combining hyper-parameter tuning with model evaluation, the test data used for evaluation is no longer statistically pure, as they have been “seen” by the models in tuning the hyper-parameter. The hyper- parameter settings retain a partial “memory” of the data that now form the test partition. Each time a model with different hyper-parameters is evaluated on a sample set, it provides information about the data. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. Hyper-parameters could be tuned in ways that exploit the meaningless statistical peculiarities of the sample. In other words, over-fitting in hyper-parameter tuning is possible whenever the CV estimate of generalisation performance evaluated over a finite sample of data is directly optimised. The CV procedure attempts to reduce this effect, yet it cannot be removed completely, especially when the sample of data is small and the number of hyper-parameters to be tuned is relatively large. You should therefore expect to observe an optimistic bias in the performance estimates obtained in this manner.
CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier, e.g. feature selection, classifier type selection and classifier hyper-parameter tuning takes place on the data not left out during each CV loop. Violating this principle in some ways can result in very biased estimates of the true error.
In Scenario 1, feature selection should have been done inside each CV loop to avoid data leakage.
To avoid undesirable optimistic bias, model evaluation must be treated as an integral part of the model fitting process and performed afresh every time a model is fitted to a new sample of data.
In Scenario 2, model performance should be evaluated on a totally unseen test set that has not been touched during the hyper-parameter optimization. In case where your data is so small that you’re not able to afford a separate hold-out set, nested CV should be used. Specifically, the inner loop is used for hyper-parameter search and the outer loop is used to estimate the generalisation error by averaging test set scores over several dataset splits.
Model performance should be evaluated on a totally unseen test set that has not been touched during the hyper-parameter optimisation. Image by Author.
Nested CV should be used in case you can’t afford a separate hold-out test set. The inner loop is used for hyper-parameter search and the outer loop is used to estimate the generalisation error.
Image by Author.
Scikit-learn does have out-of-the-box methods to support nested CV. Specifically, you can use GridSearchCV (or RandomSearchCV) for hyper-parameter search in the inner loop and cross_val_score to estimate the generalisation error in the outer loop.
For the purpose of illustrating what happens under the hood for nested CV, the code snippet below doesn’t use these off-the-shell methods. This implementation will also be helpful in case the scoring strategy you’re looking to implement is not supported by GridSearchCV . However, this approach only works when you have a small search space to optimise over. For a larger hyper-parameter search space, Scikit-learn CV tools are a neater and more efficient way to go.
Noteworthy question 1:
Which feature set to use in the production model if we apply feature selection strategy in the CV loop?
Due to the stochastic nature of train/test split, when we apply the feature selection strategy inside a CV loop, it’s likely that the best set of features found for each outer loop is slightly different (even though the model performance might almost be the same over runs). The question then is: what set of features should you use in the production model?
To answer this question, remember:
Cross-validation tests a procedure, not a single model instance.
Essentially, we use CV to estimate how well the entire model building procedure will perform on future unseen data, including data preparation strategy (e.g. imputation), feature selection strategy (e.g. p-value threshold to use for one-way ANOVA test), choice of algorithm (e.g. logistic regression vs XGBoost) and the specific algorithm configuration (e.g. number of trees in XGBoost). Once we have used CV to choose the winning procedure, we will then perform the same best-performing procedure on the whole data
set to come up with our final production model. The fitted models from CV have served their purpose of performance estimation and can now be discarded.
In that sense, whatever feature set outputted from applying the winning procedure to the whole dataset is what would be used in the final production model.
Noteworthy question 2:
If we train a model on all of the available data for the production model, how do we know how well that model will perform?
Follow up on question 1, if we apply the best-performing procedure found through CV to the whole dataset, how do we know how well that production model instance will perform?
If well-designed, CV gives you the performance measures that describes how well the finalised model trained on all historical data will perform in general. You already answered that question by using the CV procedure for model evaluation! That’s why it’s critical to make sure your CV procedure is designed appropriately.