Better-Than-Chance Classification for Signal Detection
Tags: : Statistics, Machine Learning, Machine Learning
In 2012 my friend Roee Gilron told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level. “That can’t be right” I said. “So much power is left on the table!” “Can you show it to me?” Roee replied. Well, I did. And 7 years later, our results have been published by Biostatistics.
Roee’s question led to a mass of simulations, which led to new questions, which led to new simulations. This question also attracted the interest of my other colleagues: Roy Mukamel, Jelle Goeman, and Yuval Benjamini.
The core of the work is the comparison of power, of two main approaches: (1) Detecting signal using a supervised-classifier, as described above. (2) Detecting signal using multivariate hypothesis testing, such as Hotelling’s \(T^2\) test. We call the former an accuracy test, and the latter a two-group. We studied the high-dimension-small-sample setup, where the dimension of each measurement, is comparable to the number of measurements. This setup is consistent with applications in brain-imaging and genetics.
Here is a VERY short summary of our conclusions.
- Accuracy tests are underpowerd compared to two-group tests.
- In high-dimension covariance regularization is crucial. The statistical literature has many two-group tests designed for high-dimension.
- The optimal regularization for testing, and for prediction are different.
- The interplay between the direction of the signal and the principal components of the noise has a considerable effect on power.
- Two-group tests do not require cross-validation. They are thus considerably faster to compute.
- If insisting on accuracy-tests instead of two-group tests, then resampling with-replacement has more power than without-replacement. In particular, the leave-one-out Bootstrap is better than cross-validation.
The intuiton for our main findings is the following:
- Estimating accuracies adds a discretization stage which reduces power and is needless for testing.
- In in high-dim, there is barely enough data to estimate the covariances in the original space, let alone in augmented feature spaces. Kernel tricks, and deep-nets may work fine in low-dim, but are hopeless in high-dim.
Given these findings, the tremendous popularity of accuracy tests is quite puzzling. We dare conjecture that it is partially due to the growing popularity of machine-learning, and the reversal of the inference cascade: Researchers fit a classifier, and then check if there is any difference between populations? Were researchers to start by testing for any difference between populations, and only then fit a classifier, then a two-group test would be natural starting point.
The full details can be found in [1].
[1] Jonathan D Rosenblatt, Yuval Benjamini, Roee Gilron, Roy Mukamel, Jelle J Goeman, Better-than-chance classification for signal detection, Biostatistics, https://doi.org/10.1093/biostatistics/kxz035