Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to train an lstm for speech recognition

I'm trying to train lstm model for speech recognition but don't know what training data and target data to use. I'm using the LibriSpeech dataset and it contains both audio files and their transcripts. At this point, I know the target data will be the transcript text vectorized. As for the training data, I was thinking of using the frequencies and time from each audio file (or MFCC features). If that is the correct way to approach the problem, the training data/audio will be multiple arrays, how would I input those array into my lstm model? Will I have to vectorize them?

Thanks!

like image 843
JorgeC Avatar asked Nov 25 '16 21:11

JorgeC


People also ask

Why is Lstm good for speech recognition?

The benefit of deep LSTM-RNNs over conventional LSTM-RNNs is that it optimally uses its parameters by distributing them over the space through multiple layers. Deep LSTM-RNNs have given good results in large vocabulary speech recognition tasks [15], [31].

Which algorithm is best for speech recognition?

Two popular sets of features, often used in the analysis of the speech signal are the Mel frequency cepstral coefficients (MFCC) and the linear prediction cepstral coefficients (LPCC). The most popular recognition models are vector quantization (VQ), dynamic time warping (DTW), and artificial neural network (ANN) [3].

Can I train my own speech recognition model?

So if you want to make your own speech recognition service and you have enough data why go with these services you can train your own model. Luckily there is one open-source model available which is based on Baidu’s Deep Speech research paper and referred to as Mozilla DeepSpeech.

What is the best machine learning model for speech recognition?

These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models today are Deep Speech by Baidu, and Listen Attend Spell (LAS) by Google. Both Deep Speech and LAS, are recurrent neural network (RNN) based architectures with different approaches to modeling speech recognition.

What is the best strategy for augmentation in speech recognition?

This strategy is especially helpful when data is scarce or if your model is overfitting. For speech recognition, you can do the standard augmentation techniques, like changing the pitch, speed, injecting noise, and adding reverb to your audio data. We found Spectrogram Augmentation (SpecAugment), to be a much simpler and more effective approach.

What is the use of the LSTM model?

LSTM models are used for temporal dependencies, where previous output is also an input with the current timestamp. Required to install the following libraries: python-speech-features==0.6


1 Answers

To prepare the speech dataset for feeding into the LSTM model, you can see this post - Building Speech Dataset for LSTM binary classification and also the segment Data Preparation.

As a good example, you can see this post - http://danielhnyk.cz/predicting-sequences-vectors-keras-using-rnn-lstm/. This post talks about how to predict sequence of vectors in Keras using RNN - LSTM.

I believe you will find this post (https://stats.stackexchange.com/questions/192014/how-to-implement-a-lstm-based-classifier-to-classify-speech-files-using-keras) very helpful too.

like image 162
Wasi Ahmad Avatar answered Oct 26 '22 23:10

Wasi Ahmad