I have not beeen successful in training RNN for Speech to text problem using TensorFlow. I have decided on using pure FFT (i.e. spectrogram) as training data to reproduce the results of method described in Alex Graves, and Navdeep Jaitley, 2014, and coded 3-layer Bidirectional RNN with 300 LSTM units in each. I would like to describe the steps I have followed from pre-processing the audio signal to decoding logits.
Pre-Processing:
Used specgram function from matplotlib.mlab to segment each audio signal in time-domain into frames of 20ms, NFFT = (fs/1000 * 20 samples) length, and to perform windowing cum FFT with an overlap of 7ms.
I initially tried computing power spectrum ps |fft|^2
, and dB by 10 * log10(ps)
, but TensorFlow CTC Loss function produces nan value, and further the optimizer updates all the params to nan apparently, hence I did not proceed further using this.
To mention, spectrogram is not normalised as it only makes TensorFlow produce nan values for some reason. Someone please clarify why this is happening. I have a feeling gradients are vanishing. Any recommendations on what initialiser range to use ?
Since different audio files are of varying length, I have padded frames of each batch with max_time as this is required to form a mini-batch
of shape [max_time,batch,NFFT]
.
Since all the target transcriptions are in capital letters, I have only included A-Z, blank space, and some punctuations into list of classes (32 in total), which is used to transform a string target transcription into SparseTensor.
RNN Config:
Forward, and Backward Cells, each LSTM cell with 300 units in each layer using peephole architecture, with forget bias being set to 0 initially to see the performance.
Bidirectional Dynamic RNN with project_size set to hidden_size 500
.
Sequence Length tensor appropriately assigned values for each data in batch with its maximum time length.
Since tf.nn.bidirectional_dynamic_rnn
does not include the output layer sigmoid or softmax
, I perform a linear regression outside whose weights will be of shape = [hidden_size,n_chars]
.
I have used loss function tf.nn.ctc_loss
, which returns huge values like 650 or 700 initially and slides down to maximum of 500 after few hundreds of epochs.
Finally CTC beam search decoder is used to find the best path from logits generated by output softmax or sigmoid
layer.
Now, I do not understand where I am going wrong, but I am just not getting the desired transcription (i.e., weights are not converging to yield targeted results). I request someone to please clarify why this is happening. I have tried to overfit the network with 100 audio clips, but no use. The predicted results are nowhere near the desired transcription.
Thank you for your time, and support.
There are a lot of parameters to play with. I've found the momentum
optimizer with high momentum (greater than 0.99
) tends to work well. Others have found that batching causes problems and that one should use smaller batch sizes.
Either way, convergence for these models takes a long time.
If you want to try this it's better to reproduce Eesen.
If you still want tensorflow, you can find complete at tensorflow CTC example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With