Two Recent Results in Transfer Learning for Music and Speech

In this post, I want to highlight two recent complementary results on transfer learning applied to audio — one related to music, another related to speech.

The first paper, related to music, is by Keunwoo Choi and friends, which incidentally also won the best paper award at ISMIR 2017. But that’s not the only reason why you should read it. You should read it because it’s extremely well written and experiments ask and answer many interesting questions about transfer learning for music.

In transfer learning, you are transferring representations learned on a source domain/task to a target domain/task. The setup for the Choi et al paper is very straightforward. The source task is fixed to music tagging with the label space limited to the top-50 tags. They demonstrate the learned representations perform well on a bunch of related music tasks and also on an urban sound detection task.

What’s really interesting about the Choi et al paper is, unlike the traditional approach of taking the penultimate layer in the network, they consider all possible combinations of the different intermediate layers, including a concatenation of all the intermediate layers. This sounds crazy! And it is because, for n layers, the number of combinations is 2^n. I did not think this was even possible, but Choi & friends cleverly chose a small enough ConvNet size to show these results.

The results are quite interesting. Concatenating all layers almost always wins over other combinations or just the penultimate layer. This experiment is even possible because of the size of the ConvNet they choose. I think it’s pretty neat.

The second paper is for speech and it is from us (shameless plug). We ask a complementary question: can we effectively transfer representations for across audio domains? If the ConvNets are learning hierarchical features of sounds in the audible spectrum, can we use any audio data and learn representations that make sense for speech? To do this we transferred representations from an environmental sounds dataset (UrbanSounds) to the newly released Google Speech Commands data. In addition to this, we also use a much deeper DenseNet-121 model and a new multiscale representation using dilated convolutions — we use 4 incrementally larger dilation factors for 4 convolutional kernels and stacked them in a single layer. Our observation is the multiscale representation works really well with the pre-trained representation.

Now the natural question emerges, if representations learnt from environmental sounds do well with speech, will the reverse direction work? Although we don’t report it in our paper, the reverse direction doesn’t transfer well, despite the Google Speech Commands dataset being an order of magnitude larger than UrbanSounds. While we don’t have a solid understanding of why this should happen, our intuition is the gamut of environmental sounds effectively capture the manifold of acoustic events (speech being one of them).

I think these two papers are barely scratching the surface of transfer learning for speech and audio. For example, do the Choi et al style of concatenating all the previous layers work well with deeper architectures like ResNet and DenseNet we explored in our paper? What kinds of audio data transfer well for speech tasks and why? Do MFCC features matter (our paper doesn’t even bother to use them but Choi et al do)?

There is so much to be discovered here!