Cross-Validation Classification Pitfalls in Neuroimaging

Common mistakes in cross-validation

Hopefully, after this week I’ll get to a normal schedule of posting weekly. I have finished my last application and I’m ready to buy Red Dead Redemption 2. En suite.

Cross-validation (CV) is a useful tool for machine learning that gives you an estimate of your model accuracy and can tune the parameters of your model. I will only be talking about its use in estimating model accuracy in this post and some problems with it. In my tutoria, I use CV to tune the parameters of the model and not estimate the accuracy. Now, it’s not wrong to use CV to estimate accuracy and I have some moderately strong opinions about people who do.

I will first discuss the different types of CVs and some pros and cons about them. I will then discuss my opinions about CV and why I think many publications have inflated classification results and what should be done about them. I am only talking about using diffusion, resting, and structural MRI features. I have little experience on task-fMRI classification and I’m unsure if these concerns affect task-fMRI

What is CV? Iteratively drawing a sample from a set of observations to train a model on and testing it on a hold-out set to estimate the model test error. This can be used to select the model parameters (as done in my tutorial), estimate test error, or both.

LOOCV: Iteratively Trains on the whole set of observations -1 and tests the model on the holdout observation. the estimate for the test-MSE is the average of the n test error estimates. LOOCV does a good job of not overestimating the test-MSE because the training set size stays constant.

K-fold CV: The same method as LOOCV but randomly divides the training set into k groups of approximately equal size.

Man, CV sounds great it looks like it’s on the way to solve all of our low sample size problems!

There are some papers that claim some astonishing accuracies at 100% or upper 90s with very small samples sizes. There is no way their model is that generalizable. There’s a reason why Facebook, Google, Amazon, etc. mine the data of hundreds of millions of people and still get ad guesses wrong. So why are these studies getting such high accuracies?

Accuracy increases the smaller the sample size. This is counter-intuitive because CV accuracy estimates should theoretically be decreasing in smaller sample sizes. I’ve had some hunches and others have actually run some tests. While I’m not entirely sure on the reasons, there are some contenders. Let’s think about how neuroimaging studies are run. In general, they are small, apriori counterbalanced to remove confounds. This can be interpreted in two ways: the sample isn’t representative of the general population you are studying and it could be removing hard to predict observations from your study.

Very few papers actually test their models on a separate dataset. This review makes it extremely apparent very few studies are testing on independent datasets despite the amount of open access data to use. They also have the same observation I have that smaller n studies have higher accuracies.

Another issue is that studies aren’t properly accounting for confounds (including my tutorial). Commonly in frequentist statistics, controlling for things like age, sex, etc. are important for making appropriate inferences. A common way to do this is to regress out (OLS) these confounds from the other features and then do stats. I haven’t seen studies doing this too often. As this study indicates, whole feature regression probably isn’t doing the job and fold-wise regression is the way to go. This is the same line of reasoning as to why we do preprocessing within each fold of cross-validation. I might put some code out in R in the future. The authors already have python code. I have also created a tutorial video in R.

There are clear problems with inflation of classification accuracies using cross-validation. This raises the question of the what the models are actually learning because it seems like it’s generally not we think it is (brain stuff) and is some extraneous confounds. I’ll probably make another post about what inferences/questions can we make/answer from ML models.

Avatar
Mohan Gupta
Psychology PhD Student

My research interests include the testing effect, statistical modelling, and artificial intelligence.

Related