Don’t Peek： Deep Learning without looking … at test data

What is the purpose of a theory ? To explain why something works. Sure. But what good is a theory (i.e VC) that is totally useless in practice ? A good theory makes predictions.

Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:

For a given $N\times M$ weight matrix $\mathbf{W}$ , we form the correlation matrix $\mathbf{X}$

$\mathbf{X}=\dfrac{1}{N}\mathbf{W}^{T}\mathbf{W}$

and then compute the M eigenvalues $\lambda$ of $\mathbf{X}$

$\mathbf{X}\mathbf{v}=\lambda\mathbf{v}$

We call the histogram of eigenvalues $\rho_{emp}(\lambda)$ the Empirical Spectral Density (ESD). It can nearly always be fit to a power law

$\rho_{emp}(\lambda)\sim\lambda^{-\alpha}$

We call the Power Law Universal because 80-90% of the exponents $\alpha$ lie in range

$\alpha\in[2,4]$

For fully connected layers, we just take $\mathbf{W}$ as is. For Conv2D layers with shape $(N,M,i,j)$ we consider all $i\times j$ 2D feature maps of shape $N\times M$ . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.

As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents $\alpha<2$ . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is

Are power law exponents correlated with better generalization accuracies ? … YES they are!

We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including

The VGG models, with and without BatchNormalization, such as VGG11 vs VGG11_BN
Inception V3 vs V4
SqueezeNet V1.0 vs V1.1
The ResNext101 models
The sequence of Resnet models, including Resnet18, 34, 50, 101, & 152, as well as
2 other ResNet implementations, CaffeResnet101 and FbResnet152

To compare these model versions, we can simply compute the average power law exponent $Avg(\alpha)$ , averaged across all FC weight matrices and Conv2D feature maps. (Note I only consider matrices with $M\ge 50$ . ) In nearly every case, smaller $Avg(\alpha)$ is correlated with better test accuracy (i.e. generalization performance).

The only significant caveats are:

for the VGG16 and VGG19 models, we do not include the last FC layer in the average–the layer that connects the model to the labels. In both models, this last layer has a higher power law exponent $\alpha\sim 3$ that throws off the average for the model.

the InceptionResnetV2 is an outlier. It is unclear why at this time. It is not shown here but will be discussed when these results are published.

Lets first look at the VGG models, plus a couple others, not including the final FC layer in the average (again, this only changes the results for VGG16 and VGG19).

In all cases, the pre-trained model with the better Test Accuracy has, on average, smaller power law exponents , $Avg(\alpha)$ . This is an easy comparison because we are looking at 2 versions of the same architectures, with only slight improvements. For example, VGG11_BN only differs from VGG11 because it has Batch Normalization.

The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 $Avg(\alpha)$ is larger than InceptionV4.

Now consider the Resnet models, which are increasing in size and have more architectural differences between them:

Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger $Avg(\alpha)$ than FbResnet152, but they are very close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller $Avg(\alpha)$ across a wide range of architectures.

These results are easily reproduced with this notebook.

This is an amazing result !

You can think of the power law exponent as a kind of information metric–the smaller $\alpha$ , the more information is in this layer weight matrix.

Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.

In addition to the VGG models, Inception, ResNext, SqueezeNet, and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.

I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.

Like this:

Like Loading…

Related