For my final year project i am trying to identify dog/bark/bird sounds real time (by recording sound clips). I am using MFCC as the audio features. Initially i have extracted altogether 12 MFCC vectors from a sound clip using jAudio library. Now I'm trying to train a machine learning algorithm(at the moment i have not decided the algorithm but it is most probably SVM). The sound clip size is like around 3 seconds. I need to clarify some information about this process. They are,
Do i have to train this algorithm using frame based MFCCs(12 per frame) or or overall clip based MFCCs(12 per sound clip)?
To train the algorithm do i have to consider all the 12 MFCCs as 12 different attributes or do i have to consider those 12 MFCCs as a one attribute ?
These MFCCs are the overall MFCCS for the clip,
-9.598802712290967 -21.644963856237265 -7.405551798816725 -11.638107212413201 -19.441831623156144 -2.780967392843105 -0.5792847321137902 -13.14237288849559 -4.920408873192934 -2.7111507999281925 -7.336670942457227 2.4687330348335212
Any help will be really appreciated to overcome these problems. I couldn't find out a good help on Google. :)
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum").
The mfcc function returns mel frequnecy cepstral coefficients (MFCC) over time. That is, it separates the audio into short windows and calculates the MFCC (aka feature vectors) for each window.
MFCC — Mel-Frequency Cepstral Coefficients This feature is one of the most important method to extract a feature of an audio signal and is used majorly whenever working on audio signals.
You should calculate MFCCs per frame. Since your signal varies in time, taking them over whole clip would not make sense. Worse, you might end up with dog and bird having similar representation. I'd experiment with several frame lengths. In general, they will be in order of milliseconds.
All of them should be separate features. Let machine learning algorithm decide whichever are best predictors.
Mind that MFCCs are sensitive to noise, so do check first how your samples sound. Far richer selection of audio features for extraction is offered by e.g. Yaafe library, many of which will serve better in your case. Which specifically? Here's what I found most useful in classification of bird calls:
Perhaps you might find interesting to check-out this project, especially the part where I am interfacing with Yaafe.
Back in the days I used SVMs, exactly as you are planning. Today I would definitively go with gradient boosting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With