I am trying to obtain single vector feature representations for audio files to use in a machine learning task (specifically, classification using a neural net). I have experience in computer vision and natural language processing, but I need some help getting up to speed with audio files.
There are a variety of feature descriptors for audio files out there, but it seems that MFCCs are used the most for audio classification tasks. My question is this: how do I take the MFCC representation for an audio file, which is usually a matrix (of coefficients, presumably), and turn it into a single feature vector? I am currently using librosa for this.
I have a bunch of audio files, but they all vary in their shape:
for filename in os.listdir('data'):
y, sr = librosa.load('data/' + filename)
print filename, librosa.feature.mfcc(y=y, sr=sr).shape
213493.ogg (20, 2375)
120093.ogg (20, 7506)
174576.ogg (20, 2482)
194439.ogg (20, 14)
107936.ogg (20, 2259)
What I would do as a CV person is quantize these coefficients by doing k-means and then use something like scipy.cluster.vq to get vectors of identical shape that I can use as input to my NN. Is this what you would do in the audio case as well, or are there different/better approaches to this problem?
feature. mfcc. If multi-channel audio input y is provided, the MFCC calculation will depend on the peak loudness (in decibels) across all channels.
To get the MFCC features, all we need to do is call 'feature. mfcc' of librosa and git it the audio data and corresponding sample rate of the audio signal.
Spectral featuresCompute a chromagram from a waveform or power spectrogram. Compute a mel-scaled spectrogram. Compute root-mean-square (RMS) value for each frame, either from the audio samples y or from a spectrogram S . Compute the spectral centroid.
The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT. The detailed description of various steps involved in the MFCC feature extraction is explained below.
It really depends on the task. I would try kmeans, etc, but there are a lot of cases where that might not be helpful.
There's a few good examples of using dynamic time warping with librosa.
There's also the idea of using a sliding windows of a known shape, might be good too. Then you could consider the previous prediction and a transition probability matrix.
Check out scikits.talkbox. It has various functions that help you generate MFCC from audio files. Specifically you would wanna do something like this to generate MFCCs.
import numpy as np
import scipy.io.wavfile
from scikits.talkbox.features import mfcc
sample_rate, X = scipy.io.wavfile.read("path/to/audio_file")
ceps, mspec, spec = mfcc(X)
np.save("cache_file_name", ceps) # cache results so that ML becomes fast
Then while doing ML, do something like:
X = []
ceps = np.load("cache_file_name")
num_ceps = len(ceps)
X.append(np.mean(ceps[int(num_ceps / 10):int(num_ceps * 9 / 10)], axis=0))
Vx = np.array(X)
# use Vx as input values vector for neural net, k-means, etc
I used this stuff when I was building an audio genre classification tool ( genreXpose).
PS: One handy tool for audio conversion that I used was PyDub
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With