Don’t Peek part 2： Predictions without Test Data

This is a followup to a previous post:

DON’T PEEK: DEEP LEARNING WITHOUT LOOKING … AT TEST DATA

The idea…suppose we want to compare 2 or more deep neural networks (DNNs). Maybe we are

fine tuning a DNN for transfer learning, or
comparing a new architecture to an old on, or
we are just tuning our hyper-parameters.

Can we determine which DNN will generalize best–without peeking at the test data?

Theory actually suggests–yes we can!

An Unsupervised Test Metric for DNNs

We just need to measure the average log norm of the $N_{L}$ layer weight matrices $\mathbf{W}_{l}$

$\langle\log\Vert\mathbf{W}\Vert_{F}\rangle=\dfrac{1}{N_{L}}\sum\limits^{N_{L}}_{l=1}\log\Vert \mathbf{W}_{l}\Vert_{F}$

where $\Vert\mathbf{W}\Vert_{F}$ is the Frobenius norm

$\Vert\mathbf{W}\Vert_{F}=\sum\limits^{N,M}_{n,m=1,1}W^{2}_{n,m}$

The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as

where ‘fro’ is the default norm.

It turns out that $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ is amazingly correlated with the test accuracy of a DNN. How do we know ? We can plot $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ vs the reported test accuracy for the pretrained DNNs, available in PyTorch. First, we look at the VGG models:

VGG and VGG_BN models

The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy. For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models. For each pair of models, the larger the Test Accuracy, the smaller $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ . Moreover, the correlation is nearly linear across the entire class of VGG models. We see similar behavior for …

the ResNet models

Across 4/5 pretrained ResNet models, with very different sizes, a smaller $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ generally implies a better Test Accuracy.

It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox). See the Appendix for more plots. What is more, notice that

the log Norm metric $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ is completely Unsupervised

Recall that we have not peeked at the test data–or the labels. We simply computed $\langle\log\Vert\mathbf{W}\Vert_{F}\rangle$ for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.

Imagine being able to fine tune a neural network without needing test data. Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.

Why does this work ?

Applying VC Theory of Product Norms

A recent paper by Google X and MIT shows that there is A Surprising Linear Relationship [that] Predicts Test Performance in Deep Networks. The idea is to compute a VC-like data dependent complexity metric — $\mathcal{C}$ — based on the Product Norm of the weight matrices:

$\mathcal{C}\sim\bigg[\Vert\mathbf{W}_{1}\Vert\times\Vert\mathbf{W}_{2}\Vert\cdots\Vert\mathbf{W}_{L}\Vert\bigg]$

Usually we just take $\log\Vert\mathbf{W}\Vert$ as the Frobenius norm (but any p-norm may do)

If we take the log of both sides, we get the sum

$\log\mathcal{C}\sim\bigg[\log\Vert\mathbf{W}_{1}\Vert+\log\Vert\mathbf{W}_{2}\Vert\cdots\log\Vert\mathbf{W}_{L}\Vert\bigg]$

So here we just form the average log Frobenius Norm as measure of DNN complexity, as suggested by current ML theory

$\log\mathcal{C}\rightarrow \langle\log\Vert\mathbf{W}\Vert_{F}\rangle$

And it seems to work remarkably well in practice.

Log Norms and Power Laws

We can also understand this through our Theory of Heavy Tailed Implicit Self-Regularization in Deep Neural Networks.

The theory shows that each layer weight matrix $\mathbf{W}_{l}$ of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent $\alpha_{l}$

$\mathbf{W}_{l}\rightarrow\alpha_{l}$

The exponent $\alpha$ characterizes how well the layer weight matrix represents the correlations in the training data. Smaller is better.

Smaller exponents $\alpha$ correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained). This suggests that the average power law $\hat{\alpha}$ would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.

The average power law $\hat{\alpha}$ metric is a weighted average,

$\hat{\alpha}=\sum\limits_{l=1}^{N_{L}} b_{l}\alpha_{l}$

where the layer weight factor $b_{l}$ should depend on the scale of $\mathbf{W}_{l}$ . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.

Smaller $\hat{\alpha}$ usually implies better generalization

For heavy trailed matrices, we can work out a relation between the log Norm of $\mathbf{W}$ and the power law exponent $\alpha$ :

$2\log\Vert\mathbf{W}\Vert_{F}\approx\alpha\log\lambda_{max}$

where we note that

$\log\Vert\mathbf{W}\Vert^{2}=2\log\Vert\mathbf{W}\Vert$

So the weight factor is simply the log of the maximum eigenvalue associated with $\mathbf{W}$

$b\sim\log\lambda_{max}$

In the paper will show the math; below we present numerical results to convince the reader.

This also explains why Spectral Norm Regularization Improv[e]s the Generalizability of Deep Learning. The smaller $\lambda_{max}$ gives a smaller power law contribution, and, also, a smaller log Norm. We can now relate these 2 complexity metrics:

$2\langle\log\Vert\mathbf{W}\Vert_{F}\rangle\leftrightarrow\hat{\alpha}$

We argue here that we can approximate the average Power Law metric by simply computing the average log Norm of the DNN layer weight matrices. And using this, we can actually predict the trends in generalization accuracy — without needing a test data set!

Discussion

Implications: Norms vs Power Laws

The Power Law metric $\hat{\alpha}$ is consistent with the recent theoretical results, but our approach and the intent is different:

Unlike their result, our approach does not require modifying the loss function.
Moreover, they seek a Worst Case complexity bound. We seek Average Case metrics. Incredibly, the 2 approaches are completely compatible.

But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs. Here are …

We believe this result will have large applications in hyper-parameter fine tuning DNNs. Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.

WeightWatcher

We have built a python package for Jupyter Notebooks that does this for you–the weight watcher. It works on Keras and PyTorch. We will release it shortly.

More Results

We use the OSMR Sandbox to compute the average log Norm for a wide variety of DNN models, using pyTorch, and compare to the reported Top 1 Errors. This notebook reproduces the results.

All the ResNet Models

DenseNet

SqueezeNet

DPN

SeResNet

Numerical Test of Log Norm Power Law Relations

In the plot below, we generate a number of heavy tailed matrices, and fit their ESD to a power law. Then we compare

$\dfrac{2\log\Vert\mathbf{W}\Vert}{\log\;\lambda_{max}}\;vs\;\alpha$

The code for this is:

Below are results for a variety of heavy tailed random matrices:

The plot shows the relation between the ratios $\dfrac{2\log\Vert\mathbf{W}\Vert}{\log\;\lambda_{max}}$ and the empirical power law exponents $\alpha$ . There are three striking features; the linear relation

is very clear for $\alpha<2$

saturates for $\alpha>2$ for large M extends beyond $\alpha>2$ because of finite size effects

In our next paper, we will drill into these details and explain further how this relation arises and the implications for Why Deep Learning Works. Please stay tuned! And please subscribe if this is useful to you.

Extra: Proof of the Log Norm Power Law Relation

For dessert, we now show that the log Frobenius norm squared is simply the weighted power law exponent we need

$\log\Vert\mathbf{W}\Vert_{F}^{2}\rightarrow \alpha\;\log\lambda_{max}$

and the weight factor (b) for our average Power Law metric $\hat{\alpha}$ is the log of the maximum eigenvalue of the weight matrix

$b=\log\lambda_{max}$

Spectral Theory

We first need the Empirical Spectral Density (ESD) of $\mathbf{W}$ , defined in the last blog post:

We construct $M\times M$ (uncentered, normalized) correlation matrix

$\mathbf{X}=\dfrac{1}{N}\mathbf{W}^{T}\mathbf{W}$

and compute the eigenvalues $\lambda_{i}$

$\mathbf{X}\mathbf{v}_{i}=\lambda_{i}\mathbf{v}_{i}$

We can define the Frobenius norm using the Trace

$\Vert\mathbf{W}\Vert_{F}^{2}=Tr(\mathbf{W}^{T}\mathbf{W})\sim Tr(\mathbf{X})$

Spectral theory tells us that the Trace is invariant to a change of basis, and equivalent to a sum of the eigenvalues

$Tr(\mathbf{X})=\sum\limits_{i}^{M}\lambda_{i}$

So if we know the eigenvalues of $\mathbf{X}$ , we know the norm.

As in the theory of Heavy Tailed Implicit Regularization, we compute the Empirical Spectral Density (ESD) $\rho(\lambda)$ of $\mathbf{X}$ . That is, we make a histogram and fit it to a continuous density. For most DNN weight matrices, we can reasonably fit the ESD to power law

$\rho(\lambda)\sim\lambda^{-\alpha}$

and that this will be a pretty good approximation for the finite range $\lambda\in[\lambda_{min},\lambda_{max}]$

90% of all DNN layer weight matrices fit a power law with $\alpha\le 4-5$

This lets us replace the sum in the Trace with an integral over this eigenvalue density, giving the Frobenius norm $\Vert\mathbf{W}\Vert_{F}^{2}$ as

$\Vert\mathbf{W}\Vert_{F}^{2}=\int_{\lambda_{min}}^{\lambda^{max}}\lambda\rho(\lambda)d\lambda$

To evaluate this integral, we just need a little High school calculus

$\Vert\mathbf{W}\Vert_{F}^{2}=\int_{\lambda_{min}}^{\lambda^{max}}\lambda^{1-\alpha}d\lambda$

$\rightarrow\dfrac{\alpha-1}{2-\alpha}\lambda_{max}^{2-\alpha}$

where we have assumed $\lambda_{min}\sim 0$ , and used the simple Power Law normalization $C_{\alpha}=(\alpha-1)$ . Of course, this normalization is a bit off since it does not depend on $\lambda_{max}\$ , and I will address this in more detail in the paper.

That’s ok for now. Taking the log of both sides, and ignoring the near-constant terms, we now easily get the simple linear relation we expected (for $\alpha<2$ ):

$2\log\Vert\mathbf{W}\Vert_{F}\approx(2-\alpha)\log\lambda_{max}+\cdots$

which works pretty well for large random matrices. And this relation, when plotted, looks like our plot above, just flipped.

Like this:

Like Loading…

Related