Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with variable-length text in Tensorflow

I am building a Tensorflow model to perform inference on text phrases. For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.

data = ['hello',
        'my name is Mark',
        'What is your name?']

My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:

data = [[1],
        [2, 3, 4, 5],
        [6, 4, 7, 3]

What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data? Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.

-- EDIT --

Here is a concrete example.

When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:

input = tf.placeholder(tf.int32, shape=(None, 3)

with tf.Session() as sess:
  print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))

where the first dimension of the placeholder is the minibatch size.

What if the input sequences are words in sentences of different length?

feed_dict={input:[[1, 2, 3], [1]]}
like image 664
Marco Ancona Avatar asked Jul 27 '16 17:07

Marco Ancona


2 Answers

The other two answers are correct, but low on details. I was just looking at how to do this myself.

There is machinery in TensorFlow to to all of this (for some parts it may be overkill).

Starting from a string tensor (shape [3]):

import tensorflow as tf
lines = tf.constant([
    'Hello',
    'my name is also Mark',
    'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']

The first thing to do is split this into words (note the space before the question mark.)

words = tf.string_split(lines," ")

Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:

indices    values
 0 0       'hello'
 1 0       'my'
 1 1       'name'
 1 2       'is'
 ...

Now you can do a word lookup:

table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)

This returns a sparse tensor with the words replaced by their vocabulary indices.

Now you can read out the sequence lengths by looking at the maximum position on each line :

line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position, 
                         segment_ids = line_number)+1

So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):

EMBEDDING_DIM = 100

dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)

Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].

Then a simple lstm can be built:

LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)

And run the across the sequence, handling the padding.

outputs, final_state = tf.nn.dynamic_rnn(
    cell=lstm,
    inputs=embedded,
    sequence_length=lengths,
    dtype=tf.float32)

Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations function:

from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))

That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]

init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
    init_t.run()
    init.run()
    print(final_output.eval().shape())

I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.

like image 115
mdaoust Avatar answered Oct 19 '22 02:10

mdaoust


How about this? (I didn’t implement this. but maybe this idea will work.) This method is based on BOW representation.

  1. Get your data as tf.string
  2. Split it using tf.string_split
  3. Find indexes of your words using tf.contrib.lookup.string_to_index_table_from_file or tf.contrib.lookup.string_to_index_table_from_tensor. Length of this tensor can vary.
  4. Find embeddings of your indexes.
    word_embeddings = tf.get_variable(“word_embeddings”,
                                      [vocabulary_size, embedding_size])
    embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
  1. Sum up the embeddings. And you will get a tensor of fixed length(=embedding size). Maybe you can choose another method then sum.(avg, mean or something else)

Maybe it’s too late :) Good luck.

like image 1
plhn Avatar answered Oct 19 '22 04:10

plhn