Better-Than-Chance Classification for Signal Detection

Tags: : Statistics, Machine Learning, Machine Learning

In 2012 my friend Roee Gilron told me about a popular workflow for detecting activation in the brain: fit a classifier, then use a permutation test to check if its cross-validated accuracy is better than chance level. “That can’t be right” I said. “So much power is left on the table!” “Can you show it to me?” Roee replied. Well, I did. And 7 years later, our results have been published by Biostatistics.

Roee’s question led to a mass of simulations, which led to new questions, which led to new simulations. This question also attracted the interest of my other colleagues: Roy Mukamel, Jelle Goeman, and Yuval Benjamini.

The core of the work is the comparison of power, of two main approaches: (1) Detecting signal using a supervised-classifier, as described above. (2) Detecting signal using multivariate hypothesis testing, such as Hotelling’s \(T^2\) test. We call the former an accuracy test, and the latter a two-group. We studied the high-dimension-small-sample setup, where the dimension of each measurement, is comparable to the number of measurements. This setup is consistent with applications in brain-imaging and genetics.

Here is a VERY short summary of our conclusions.

Accuracy tests are underpowerd compared to two-group tests.
In high-dimension covariance regularization is crucial. The statistical literature has many two-group tests designed for high-dimension.
The optimal regularization for testing, and for prediction are different.
The interplay between the direction of the signal and the principal components of the noise has a considerable effect on power.
Two-group tests do not require cross-validation. They are thus considerably faster to compute.
If insisting on accuracy-tests instead of two-group tests, then resampling with-replacement has more power than without-replacement. In particular, the leave-one-out Bootstrap is better than cross-validation.

The intuiton for our main findings is the following:

Estimating accuracies adds a discretization stage which reduces power and is needless for testing.
In in high-dim, there is barely enough data to estimate the covariances in the original space, let alone in augmented feature spaces. Kernel tricks, and deep-nets may work fine in low-dim, but are hopeless in high-dim.

Given these findings, the tremendous popularity of accuracy tests is quite puzzling. We dare conjecture that it is partially due to the growing popularity of machine-learning, and the reversal of the inference cascade: Researchers fit a classifier, and then check if there is any difference between populations? Were researchers to start by testing for any difference between populations, and only then fit a classifier, then a two-group test would be natural starting point.

The full details can be found in [1].

[1] Jonathan D Rosenblatt, Yuval Benjamini, Roee Gilron, Roy Mukamel, Jelle J Goeman, Better-than-chance classification for signal detection, Biostatistics, https://doi.org/10.1093/biostatistics/kxz035

Written on October 15, 2019