I'm looking for some advice on Dynamic Time Warping (DTW).
I have a Python script and extract Mel-Frequency Cepstral Coefficient (MFCC) feature vectors from .WAV files of various lengths. The feature vectors are arrays of varying lengths that contain arrays of 12 MFCCs.
For example, one .WAV file may be represented by an array that contains 10 sets of 12 feature vectors whilst another .WAV file may be represented by one array that contains 20 sets of 12 feature vectors.
I intend to use DTW to compare the two arrays of arrays, but I'm unsure how. I understand the concept of DTW and would have no issue implementing it if the feature vectors contained within the array were single numbers, my confusion is due to the fact that they are arrays.
Tl;dr: How would one compare two arrays of arrays using DTW?
Edit: I have read this question with no avail.
Many thanks, Adam
It is observed that extracting features from the audio signal and using it as input to the base model will produce much better performance than directly considering raw audio signal as input. MFCC is the widely used technique for extracting the features from the audio signal.
The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT.
The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.
MFCC has 39 features.
There is a nice tutorial on DTW here
I have done this in a dozen papers, see zebra finch example here
A key thing to note. You probably want to compare just ONE feature vector to the corresponding feature vector. It is rare that it is useful to use all 12.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With