I'm trying to use CTC for speech recognition using keras and have tried the CTC example here. In that example, the input to the CTC Lambda
layer is the output of the softmax layer (y_pred
). The Lambda
layer calls ctc_batch_cost
that internally calls Tensorflow's ctc_loss
, but the Tensorflow ctc_loss
documentation say that the ctc_loss
function performs the softmax internally so you don't need to softmax your input first. I think the correct usage is to pass inner
to the Lambda
layer so you only apply softmax once in ctc_loss
function internally. I have tried the example and it works. Should I follow the example or the Tensorflow documentation?
A Connectionist Temporal Classification Loss, or CTC Loss, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence.
Connectionist temporal classification (CTC) ASR decoding is mainly composed of two major steps: the mapping and the searching. In the mapping, we map the acoustic information of an audio frame to a triphone state. This is the alignment process. This is a many-to-one mapping.
CTC is an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems. CTC is used when we don't know how the input aligns with the output (how the characters in the transcript align to the audio).
SortaGrad uses the length of the utterance. as a heuristic for difficulty, since long utterances have higher cost than short utterances.
The loss used in the code you posted is different from the one you linked. The loss used in the code is found here
The keras code peforms some pre-processing before calling the ctc_loss
that makes it suitable for the format required. On top of requiring the input to be not softmax-ed, tensorflow's ctc_loss
also expects the dims to be NUM_TIME, BATCHSIZE, FEATURES
. Keras's ctc_batch_cost
does both of these things in this line.
It does log() which gets rid of the softmax scaling and it also shuffles the dims so that its in the right shape. When I say gets rid of softmax scaling, it obviously does not restore the original tensor, but rather softmax(log(softmax(x))) = softmax(x)
. See below:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
x = [1,2,3]
y = softmax(x)
z = np.log(y) # z =/= x (obviously) BUT
yp = softmax(z) # yp = y #####
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With