I am building a Tensorflow model to perform inference on text phrases. For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.
data = ['hello',
'my name is Mark',
'What is your name?']
My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:
data = [[1],
[2, 3, 4, 5],
[6, 4, 7, 3]
What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data? Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.
-- EDIT --
Here is a concrete example.
When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:
input = tf.placeholder(tf.int32, shape=(None, 3)
with tf.Session() as sess:
print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))
where the first dimension of the placeholder is the minibatch size.
What if the input sequences are words in sentences of different length?
feed_dict={input:[[1, 2, 3], [1]]}
The other two answers are correct, but low on details. I was just looking at how to do this myself.
There is machinery in TensorFlow to to all of this (for some parts it may be overkill).
Starting from a string tensor (shape [3]):
import tensorflow as tf
lines = tf.constant([
'Hello',
'my name is also Mark',
'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']
The first thing to do is split this into words (note the space before the question mark.)
words = tf.string_split(lines," ")
Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:
indices values
0 0 'hello'
1 0 'my'
1 1 'name'
1 2 'is'
...
Now you can do a word lookup:
table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)
This returns a sparse tensor with the words replaced by their vocabulary indices.
Now you can read out the sequence lengths by looking at the maximum position on each line :
line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position,
segment_ids = line_number)+1
So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):
EMBEDDING_DIM = 100
dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)
Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].
Then a simple lstm can be built:
LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)
And run the across the sequence, handling the padding.
outputs, final_state = tf.nn.dynamic_rnn(
cell=lstm,
inputs=embedded,
sequence_length=lengths,
dtype=tf.float32)
Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations
function:
from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))
That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]
init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
init_t.run()
init.run()
print(final_output.eval().shape())
I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.
How about this? (I didn’t implement this. but maybe this idea will work.) This method is based on BOW representation.
tf.string
tf.string_split
tf.contrib.lookup.string_to_index_table_from_file
or tf.contrib.lookup.string_to_index_table_from_tensor
. Length of this tensor can vary.word_embeddings = tf.get_variable(“word_embeddings”, [vocabulary_size, embedding_size]) embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
sum
.(avg
, mean
or something else)Maybe it’s too late :) Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With