I'm trying to train lstm model for speech recognition but don't know what training data and target data to use. I'm using the LibriSpeech dataset and it contains both audio files and their transcripts. At this point, I know the target data will be the transcript text vectorized. As for the training data, I was thinking of using the frequencies and time from each audio file (or MFCC features). If that is the correct way to approach the problem, the training data/audio will be multiple arrays, how would I input those array into my lstm model? Will I have to vectorize them?
Thanks!
The benefit of deep LSTM-RNNs over conventional LSTM-RNNs is that it optimally uses its parameters by distributing them over the space through multiple layers. Deep LSTM-RNNs have given good results in large vocabulary speech recognition tasks [15], [31].
Two popular sets of features, often used in the analysis of the speech signal are the Mel frequency cepstral coefficients (MFCC) and the linear prediction cepstral coefficients (LPCC). The most popular recognition models are vector quantization (VQ), dynamic time warping (DTW), and artificial neural network (ANN) [3].
So if you want to make your own speech recognition service and you have enough data why go with these services you can train your own model. Luckily there is one open-source model available which is based on Baidu’s Deep Speech research paper and referred to as Mozilla DeepSpeech.
These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models today are Deep Speech by Baidu, and Listen Attend Spell (LAS) by Google. Both Deep Speech and LAS, are recurrent neural network (RNN) based architectures with different approaches to modeling speech recognition.
This strategy is especially helpful when data is scarce or if your model is overfitting. For speech recognition, you can do the standard augmentation techniques, like changing the pitch, speed, injecting noise, and adding reverb to your audio data. We found Spectrogram Augmentation (SpecAugment), to be a much simpler and more effective approach.
LSTM models are used for temporal dependencies, where previous output is also an input with the current timestamp. Required to install the following libraries: python-speech-features==0.6
To prepare the speech dataset for feeding into the LSTM model, you can see this post - Building Speech Dataset for LSTM binary classification and also the segment Data Preparation.
As a good example, you can see this post - http://danielhnyk.cz/predicting-sequences-vectors-keras-using-rnn-lstm/. This post talks about how to predict sequence of vectors in Keras using RNN - LSTM.
I believe you will find this post (https://stats.stackexchange.com/questions/192014/how-to-implement-a-lstm-based-classifier-to-classify-speech-files-using-keras) very helpful too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With