Up to now I’ve mostly analysed meta data about music, and when I have looked at the track content I’ve focused on the lyrics. Now I want to look at analysing the sound itself. In this post I will demonstrate how to extract some useful information from an audio file using Python.
Starting with a basic question; how do I convert music to data? For analogue sound this is impractical, however, digital music is effectively data. Sound is just pressure waves, and these waves can be represented by numbers over a time period. Music stored as .WAV, are the audio waves stored as numbers, and MP3 files are a compressed version of the .WAV
I began with a sample of the track Inspiration Information by Shuggie Otis provided by Spotify. I download this MP3 file, uncompress it to a WAV, then read in the WAV file as a data arrray
#required librariesimport urllibimport scipy.io.wavfileimport pydub
#a temp folder for downloadstemp_folder=”/Users/home/Desktop/”
#spotify mp3 sample fileweb_file=”http://p.scdn.co/mp3-preview/35b4ce45af06203992a86fa729d17b1c1f93cac5”
#download fileurllib.urlretrieve(web_file,temp_folder+”file.mp3”)#read mp3 filemp3 = pydub.AudioSegment.from_mp3(temp_folder+”file.mp3”)#convert to wavmp3.export(temp_folder+”file.wav”, format=”wav”)#read wav filerate,audData=scipy.io.wavfile.read(temp_folder+”file.wav”)
print(rate)print(audData)
The output from the wavefile.read are the sampling rate on the track, and the audio wave data. The sampling rate represents the number of data points sampled per second in the audio file. In this case 44100 pieces of information per second make up the audio wave. This is a very common rate. The higher the rate, the better quality the audio.
If the number of data points in the audio wave is divided by the rate we can get the length of the track in seconds. In this case 30s#wav lengthaudData.shape[0] / rate
Looking at the shape of the audio data it has two arrays of equal length. It is a stereo recording so there is separate data for the left and right channels
#wav number of channels mono/stereoaudData.shape[1]#if stereo grab both channelschannel1=audData[:,0] #leftchannel2=audData[:,1] #right
The data is stored as int16. This is the size of the data stored in each datapoint. Common storage formats are 8, 16, 32. Again the higher this is the better the audio quality
audData.dtype
In the same way you can read a file, you can also save the data back to a WAV file. This means it is possible to manipulate the sound data then save it.
#save wav filescipy.io.wavfile.write(temp_folder+"file2.wav", rate, audData)#save a file at half and double speedscipy.io.wavfile.write(temp_folder+"file2.wav", rate/2, audData)scipy.io.wavfile.write(temp_folder+"file2.wav", rate*2, audData)#save a single channelscipy.io.wavfile.write(temp_folder+"file2.wav", rate, channel1)
I also tried creating a mono version, by averaging the data in the left and right channel. This works to a point but does seem to damage the audio.
import numpy as np#averaging the channels damages the musicmono=np.sum(audData.astype(float), axis=1)/2scipy.io.wavfile.write(temp_folder+"file2.wav", rate, mono)
The values in the data represent the amplitude of the wave (or the loudness of the audio). The energy of the audio can be described by the sum of the absolute amplitude.
#Energy of musicnp.sum(channel1.astype(float)**2)
This will depend on the length of the audio, the sample rate and the volume of the audio. A better metric is the Power which is energy per second
#power - energy per unit of time1.0/(2*(channel1.size)+1)*np.sum(channel1.astype(float)**2)/rate
Next i wanted to plot my track. I plotted the amplitude over time for each channel
import matplotlib.pyplot as plt
#create a time variable in secondstime = np.arange(0, float(audData.shape[0]), 1) / rate
#plot amplitude (or loudness) over timeplt.figure(1)plt.subplot(211)plt.plot(time, channel1, linewidth=0.01, alpha=0.7, color=’#ff7f00’)plt.xlabel(‘Time (s)’)plt.ylabel(‘Amplitude’)plt.subplot(212)plt.plot(time, channel2, linewidth=0.01, alpha=0.7, color=’#ff7f00’)plt.xlabel(‘Time (s)’)plt.ylabel(‘Amplitude’)plt.show()
The next thing to look at is the frequency of the audio. In order to do this you need todecompose the single audio wave into audio waves at different frequencies. This can be done using a Fourier transform. However, the last time I thought about Fourier transforms was at university, so I thought I better brush up. I went through the first few weeks of this free signal processing course on coursera, and it was a great help.
The Fourier transform effectively iterates through a frequency for as many frequencies as there are records (N) in the dataset, and determines the Amplitude of that frequency. The frequency for record (fk) can be calculated using the sampling rate (fs)
The following code performs the Fourier transformation on the left channel sound and plots it. The maths produces a symetrical result, with one real data solution, and an imaginary data solution
from numpy import fft as fft
fourier=fft.fft(channel1)
plt.plot(fourier, color=’#ff7f00’)plt.xlabel(‘k’)plt.ylabel(‘Amplitude’)
We only need the real data solution, so we can grab the first half, then calculate the frequency and plot the frequency against a scaled amplitude.
n = len(channel1)fourier = fourier[0:(n/2)]
scale by the number of points so that the magnitude does not depend on the lengthfourier = fourier / float(n)
#calculate the frequency at each point in HzfreqArray = np.arange(0, (n/2), 1.0) * (rate*1.0/n);
plt.plot(freqArray/1000, 10*np.log10(fourier), color=’#ff7f00’, linewidth=0.02)plt.xlabel(‘Frequency (kHz)’)plt.ylabel(‘Power (dB)’)
Another common way to analyse audio is to create a spectogram. Audio spectograms are heat maps that show the frequencies of the sound in Hertz (Hz), the volume of the sound in Decibels (dB), against time.
In order to calculate a Fourier transform over time the specgram function used below uses a time window based Fast Fourier transform. This simplifies the calculation involved, and makes it possible to do in seconds. It calculates many Fourier transforms over blocks of data ‘NFFT’ long. Each Fourier transform over a block, results in the frequencies represented in that block, and to what magnitude. So the resultant array is NFFT times smaller than the original data. The range of frequencies explored relates to half the sample rate. The number of samples in the block (NFFT) determines how many frequencies in that range are considered. So a bigger block results in a greater frequency range, but reduces the information with respect to time.
plt.figure(2, figsize=(8,6))plt.subplot(211)Pxx, freqs, bins, im = plt.specgram(channel1, Fs=rate, NFFT=1024, cmap=plt.get_cmap('autumn_r'))cbar=plt.colorbar(im)plt.xlabel('Time (s)')plt.ylabel('Frequency (Hz)')cbar.set_label('Intensity dB')plt.subplot(212)Pxx, freqs, bins, im = plt.specgram(channel2, Fs=rate, NFFT=1024, cmap=plt.get_cmap('autumn_r'))cbar=plt.colorbar(im)plt.xlabel('Time (s)')plt.ylabel('Frequency (Hz)')cbar.set_label('Intensity (dB)')#plt.show()
The result allows us to pick out a certain frequency and examine it
np.where(freqs==10034.47265625)MHZ10=Pxx[233,:]plt.plot(bins, MHZ10, color='#ff7f00')
So thats the basics of audio processing. I’m now looking forward to analysing my favourite music. I’m sure there will be posts on that to come.
As always all of the above code can be found together in the following gist