I am trying to obtain single vector feature representations for audio files to use in a machine learning task (specifically, classification using a neural net). I have experience in computer vision and natural language processing, but I need some help getting up to speed with audio files. There are a variety of feature descriptors for audio files out there, but it seems that MFCCs are used the most for audio classification tasks. My question is this: how do I take the MFCC representation for an audio file, which is usually a matrix (of coefficients, presumably), and turn it into a single feature vector? I am currently using librosa for this. I have a bunch of audio files, but they all vary in their shape: <pre class="prettyprint"><code>for filename in os.listdir('data'): y, sr = librosa.load('data/' + filename) print filename, librosa.feature.mfcc(y=y, sr=sr).shape 213493.ogg (20, 2375) 120093.ogg (20, 7506) 174576.ogg (20, 2482) 194439.ogg (20, 14) 107936.ogg (20, 2259) </code></pre> What I would do as a CV person is quantize these coefficients by doing k-means and then use something like scipy.cluster.vq to get vectors of identical shape that I can use as input to my NN. Is this what you would do in the audio case as well, or are there different/better approaches to this problem?

It really depends on the task. I would try kmeans, etc, but there are a lot of cases where that might not be helpful. There's a few good examples of using dynamic time warping with librosa. There's also the idea of using a sliding windows of a known shape, might be good too. Then you could consider the previous prediction and a transition probability matrix.

Check out scikits.talkbox. It has various functions that help you generate MFCC from audio files. Specifically you would wanna do something like this to generate MFCCs. <pre class="prettyprint"><code>import numpy as np import scipy.io.wavfile from scikits.talkbox.features import mfcc sample_rate, X = scipy.io.wavfile.read("path/to/audio_file") ceps, mspec, spec = mfcc(X) np.save("cache_file_name", ceps) # cache results so that ML becomes fast </code></pre> Then while doing ML, do something like: <pre class="prettyprint"><code>X = [] ceps = np.load("cache_file_name") num_ceps = len(ceps) X.append(np.mean(ceps[int(num_ceps / 10):int(num_ceps * 9 / 10)], axis=0)) Vx = np.array(X) # use Vx as input values vector for neural net, k-means, etc </code></pre> I used this stuff when I was building an audio genre classification tool ( genreXpose). PS: One handy tool for audio conversion that I used was PyDub

MFCC feature descriptors for audio classification using librosa

Tags:

python

machine-learning

audio

I am trying to obtain single vector feature representations for audio files to use in a machine learning task (specifically, classification using a neural net). I have experience in computer vision and natural language processing, but I need some help getting up to speed with audio files.

There are a variety of feature descriptors for audio files out there, but it seems that MFCCs are used the most for audio classification tasks. My question is this: how do I take the MFCC representation for an audio file, which is usually a matrix (of coefficients, presumably), and turn it into a single feature vector? I am currently using librosa for this.

I have a bunch of audio files, but they all vary in their shape:

for filename in os.listdir('data'):
    y, sr = librosa.load('data/' + filename)
    print filename, librosa.feature.mfcc(y=y, sr=sr).shape

213493.ogg (20, 2375)
120093.ogg (20, 7506)
174576.ogg (20, 2482)
194439.ogg (20, 14)
107936.ogg (20, 2259)

What I would do as a CV person is quantize these coefficients by doing k-means and then use something like scipy.cluster.vq to get vectors of identical shape that I can use as input to my NN. Is this what you would do in the audio case as well, or are there different/better approaches to this problem?

491

asked Sep 23 '14 06:09

Doa

2 Answers

It really depends on the task. I would try kmeans, etc, but there are a lot of cases where that might not be helpful.

There's a few good examples of using dynamic time warping with librosa.

There's also the idea of using a sliding windows of a known shape, might be good too. Then you could consider the previous prediction and a transition probability matrix.

177

answered Oct 17 '22 00:10

Shane Walker

Check out scikits.talkbox. It has various functions that help you generate MFCC from audio files. Specifically you would wanna do something like this to generate MFCCs.

import numpy as np
import scipy.io.wavfile
from scikits.talkbox.features import mfcc

sample_rate, X = scipy.io.wavfile.read("path/to/audio_file")
ceps, mspec, spec = mfcc(X)
np.save("cache_file_name", ceps) # cache results so that ML becomes fast

Then while doing ML, do something like:

X = []
ceps = np.load("cache_file_name")
num_ceps = len(ceps)
X.append(np.mean(ceps[int(num_ceps / 10):int(num_ceps * 9 / 10)], axis=0))
Vx = np.array(X)
# use Vx as input values vector for neural net, k-means, etc

I used this stuff when I was building an audio genre classification tool ( genreXpose).

PS: One handy tool for audio conversion that I used was PyDub

answered Oct 17 '22 00:10

jazdev

Related questions
                            
                                Python subprocess: wait for command to finish before starting next one?
                            
                                Replace x with y or append y if no x
                            
                                Using pyserial to send binary data
                            
                                How to conjugate a verb in NLTK given POS tag?
                            
                                Mocking urllib2.urlopen().read() for different responses
                            
                                Python / Django multi-tenancy solution
                            
                                Does python have Matlab's `ans` variable that captures returned value not stored in any variable?
                            
                                In a gevent application, how can I kill all greenlets that have been started?
                            
                                getting seconds from numpy timedelta64
                            
                                Redis Queue + python-rq: Right pattern to prevent high memory usage?
                            
                                Python class method chaining
                            
                                using python WeakSet to enable a callback functionality
                            
                                Storing a dict with np.savez gives unexpected result?
                            
                                Using Pandas, how do I drop the last row of each group?
                            
                                ImportError: No module named gi.repository
                            
                                Reading back tuples from a csv file with pandas
                            
                                pow or ** for very large number in Python
                            
                                NetworkX largest component no longer working?
                            
                                Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python
                            
                                how to aggregate elements of a list of tuples if the tuples have the same first element?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With