I am using Tensorflow's tf.nn.ctc_beam_search_decoder()
to decode the output of a RNN doing some many-to-many mapping (i.e., multiple softmax outputs for each network cell).
A simplified version of the network's output and the Beam search decoder is:
import numpy as np
import tensorflow as tf
batch_size = 4
sequence_max_len = 5
num_classes = 3
y_pred = tf.placeholder(tf.float32, shape=(batch_size, sequence_max_len, num_classes))
y_pred_transposed = tf.transpose(y_pred,
perm=[1, 0, 2]) # TF expects dimensions [max_time, batch_size, num_classes]
logits = tf.log(y_pred_transposed)
sequence_lengths = tf.to_int32(tf.fill([batch_size], sequence_max_len))
decoded, log_probabilities = tf.nn.ctc_beam_search_decoder(logits,
sequence_length=sequence_lengths,
beam_width=3,
merge_repeated=False, top_paths=1)
decoded = decoded[0]
decoded_paths = tf.sparse_tensor_to_dense(decoded) # Shape: [batch_size, max_sequence_len]
with tf.Session() as session:
tf.global_variables_initializer().run()
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
decoded_paths = session.run(decoded_paths, feed_dict = {y_pred: softmax_outputs})
print(decoded_paths)
The output in this case is:
[[0]
[1]
[1]
[1]]
My understanding is that the output tensor should be of dimensions [batch_size, max_sequence_len]
, with each row containing the indices of the relevant classes in the found path.
In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 2, 2, 2, 2],
[2, 2, 2, 2, 2]]
What am I not understanding about how ctc_beam_search_decoder
works?
As indicated in tf.nn.ctc_beam_search_decoder documentation, the shape of the output is not [batch_size, max_sequence_len]
. Instead, it is
[batch_size, max_decoded_length[j]]
(with j=0
in your case).
Based on the beginning of section 2 of this paper (which is cited in the github repository), max_decoded_length[0]
is bounded from above by max_sequence_len
, but they are not necessarily equal. The relevant citation is:
Let S be a set of training examples drawn from a fixed distribution D_{XxZ}. The input space X = (R^m) is the set of all sequences of m dimensional real valued vectors. The target space Z = L* is the set of all sequences over the (finite) alphabet L of labels. In general, we refer to elements of L* as label sequences or labellings. Each example in S consists of a pair of sequences (x, z). The target sequence z = (z1, z2, ..., zU) is at most as long as the input sequence x = (x1, x2, ..., xT ), i.e. U<=T. Since the input and target sequences are not generally the same length, there is no a priori way of aligning them.
In fact, max_decoded_length[0]
depends on the specific matrix softmax_outputs
. In particular, two such matrices with exactly the same dimensions can result in different max_decoded_length[0]
.
For example, if you replace the row
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
with the rows
np.random.seed(7)
r=np.random.randint(0,100,size=(4,5,3))
softmax_outputs=r/np.sum(r,2).reshape(4,5,1)
you'll get the output
[[1 0 1]
[1 0 1]
[1 0 0]
[1 0 0]]
(in the above examples, softmax_outputs
consists of logits and it is exactly of the same dimensions as the matrix you provided).
On the other hand, changing the seed to np.random.seed(50)
gives the output
[[1 0]
[1 0]
[1 0]
[0 1]]
P.S.
Regarding the last part of your question:
In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0], [2, 2, 2, 2, 2], [1, 2, 2, 2, 2], [2, 2, 2, 2, 2]]
Note that, based on the documentation, num_classes
actually represents num_labels + 1
. Specifically:
The inputs Tensor's innermost dimension size,
num_classes
, representsnum_labels + 1
classes, wherenum_labels
is the number of true labels, and the largest value (num_classes - 1
) is reserved for the blank label.For example, for a vocabulary containing 3 labels [a, b, c],
num_classes = 4
and the labels indexing is {a: 0, b: 1, c: 2, blank: 3}.
So the true labels in your case are 0 and 1, and 2 is reserved for the blank label. The blank label represents the situation of observing no label (section 3.1 here):
A CTC network has a softmax output layer (Bridle, 1990) with one more unit than there are labels in L. The activations of the first |L| units are interpreted as the probabilities of observing the corresponding labels at particular times. The activation of the extra unit is the probability of observing a ‘blank’, or no label. Together, these outputs define the probabilities of all possible ways of aligning all possible label sequences with the input sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With