Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829
. The 20
here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829
. What is the frame size it takes process the audio?
import numpy as np
import matplotlib.pyplot as plt
import librosa
def getPathToGroundtruth(episode):
"""Return path to groundtruth file for episode"""
pathToGroundtruth = "../../../season01/Audio/" \
+ "Season01.Episode%02d.en.wav" % episode
return pathToGroundtruth
def getduration(episode):
pathToAudioFile = getPathToGroundtruth(episode)
y, sr = librosa.load(pathToAudioFile)
duration = librosa.get_duration(y=y, sr=sr)
return duration
def getMFCC(episode):
filename = getPathToGroundtruth(episode)
y, sr = librosa.load(filename) # Y gives
data = librosa.feature.mfcc(y=y, sr=sr)
return data
data = getMFCC(1)
Mel-frequency cepstral coefficients (MFCCs) Warning. If multi-channel audio input y is provided, the MFCC calculation will depend on the peak loudness (in decibels) across all channels. The result may differ from independent MFCC calculation of each channel.
The default frame and hop lengths are set to 2048 and 512 samples, respectively.
Librosa is a Python package for music and audio analysis. Librosa is basically used when we work with audio data like in music generation(using LSTM's), Automatic Speech Recognition. It provides the building blocks necessary to create the music information retrieval systems.
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
Long Answer
Librosa's librosa.feature.mfcc()
function really just acts as a wrapper to librosa's librosa.feature.melspectrogram()
function (which is a wrapper to librosa.core.stft
and librosa.filters.mel
functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc()
function.
All extra **kwargs
parameters are fed to librosa.feature.melspectrogram()
and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050
), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
The parameters that you are able to tune, are the following:
Melspectrogram Parameters
-------------------------
y : np.ndarray [shape=(n,)] or None
audio time-series
sr : number > 0 [scalar]
sampling rate of `y`
S : np.ndarray [shape=(d, t)]
power spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.core.stft`
kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following
Mel Frequency Parameters
------------------------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use `fmax = sr / 2.0`
htk : bool [scalar]
use HTK formula instead of Slaney
Documentation for Librosa:
librosa.feature.melspectrogram
librosa.filters.mel
librosa.core.stft
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With