Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tf.SequenceExample with multidimensional arrays

In Tensorflow, I want to save a multidimensional array to a TFRecord. For example:

[[1, 2, 3], [1, 2], [3, 2, 1]]

As the task I am trying to solve is sequential, I am trying to use Tensorflow's tf.train.SequenceExample() and when writing the data I am successful in writing the data to a TFRecord file. However, when I try to load the data from the TFRecord file using tf.parse_single_sequence_example, I am greeted with a large number of cryptic errors:

W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []
E tensorflow/core/client/tensor_c_api.cc:485] Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []

The function I am using to try to load my data is below:

def read_and_decode_single_example(filename):

    filename_queue = tf.train.string_input_producer([filename],
                                                num_epochs=None)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    context_features = {
         "length": tf.FixedLenFeature([], dtype=tf.int64)
    }

    sequence_features = {
         "input_characters": tf.FixedLenSequenceFeature([],           dtype=tf.int64),
         "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=serialized_example,
    context_features=context_features,
    sequence_features=sequence_features
)

context = tf.contrib.learn.run_n(context_parsed, n=1, feed_dict=None)
print context

The function that I am using to save the data is here:

# http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/
def make_example(input_sequence, output_sequence):
    """
    Makes a single example from Python lists that follows the
    format of tf.train.SequenceExample.
    """

    example_sequence = tf.train.SequenceExample()

    # 3D length
    sequence_length = sum([len(word) for word in input_sequence])
    example_sequence.context.feature["length"].int64_list.value.append(sequence_length)

    input_characters = example_sequence.feature_lists.feature_list["input_characters"]
    output_characters = example_sequence.feature_lists.feature_list["output_characters"]

    for input_character, output_character in izip_longest(input_sequence,
                                                          output_sequence):

        # Extend seems to work, therefore it replaces append.
        if input_sequence is not None:
            input_characters.feature.add().int64_list.value.extend(input_character)

        if output_characters is not None:
            output_characters.feature.add().int64_list.value.extend(output_character)

    return example_sequence

Any help would be welcomed.

like image 514
Torkoal Avatar asked Sep 16 '16 05:09

Torkoal


2 Answers

I had the same problem. I think that it is entirely solveable, but you have to decide on the output format, and then figure out how you're going to use it.

First what is your error?

The error message is telling you that what you are trying to read doesn't fit into the feature size that you specified. So where did you specify it? Right here:

sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}

This says "my input_characters is a sequence of single values", but this is not true; what you have is a sequence of sequences of single values and hence an error.

Second what can you do?

If you instead use:

a = [[1,2,3], [2,3,1], [3,2,1]] 
sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64)
}

You will not have an error with your code because you have specified that each element of the top level sequence is 3 elements long.

Alternatively, if you do not have fixed length sequences, then you're going to have to use a different type of feature.

sequence_features = {
    "input_characters": tf.VarLenFeature(tf.int64),
    "output_characters": tf.VarLenFeature(tf.int64)
}

The VarLenFeature tells it that the length is unknown before reading. Unfortunately this means that your input_characters can no longer be read as a dense vector in one step. Instead, it will be a SparseTensor by default. You can turn this into a dense tensor with tf.sparse_tensor_to_dense eg:

input_densified = tf.sparse_tensor_to_dense(sequence_parsed['input_characters'])

As mentioned in the article that you've been looking at, if your data does not always have the same length you will have to have a "not_really_a_word" word in your vocabulary, which you use as the default index. e.g. let's say you have index 0 mapping to the "not_really_a_word" word, then using your

a = [[1,2,3],  [2,3],  [3,2,1]]

python list will end up being a

array((1,2,3),  (2,3,0),  (3,2,1))

tensor.

Be warned; I'm not certain that back-propagation "just works" for SparseTensors, like it does for dense tensors. The wildml article talks about padding 0s per sequence masking the loss for the "not_actually_a_word" word (see: "SIDE NOTE: BE CAREFUL WITH 0’S IN YOUR VOCABULARY/CLASSES" in their article). This seems to suggest that the first method will be easier to implement.

Note that this is different to the case described here where each example is a sequence of sequences. To my understanding, the reason this kind of method is not well supported is because it is an abuse of the case that this is meant to support; loading fixed-size embeddings directly.


I will assume that the very next thing you want to do is to turn those numbers into word embeddings. You can turn a list of indices into a list of embeddings with tf.nn.embedding_lookup

like image 124
Multihunter Avatar answered Oct 11 '22 15:10

Multihunter


With the provided code I wasn't able to reproduce your error but making some educated guesses gave the following working code.

import tensorflow as tf
import numpy as np
import tempfile

tmp_filename = 'tf.tmp'

sequences = [[1, 2, 3], [1, 2], [3, 2, 1]]
label_sequences = [[0, 1, 0], [1, 0], [1, 1, 1]]

def make_example(input_sequence, output_sequence):
    """
    Makes a single example from Python lists that follows the
    format of tf.train.SequenceExample.
    """

    example_sequence = tf.train.SequenceExample()

    # 3D length
    sequence_length = len(input_sequence)

    example_sequence.context.feature["length"].int64_list.value.append(sequence_length)

    input_characters = example_sequence.feature_lists.feature_list["input_characters"]
    output_characters = example_sequence.feature_lists.feature_list["output_characters"]

    for input_character, output_character in zip(input_sequence,
                                                          output_sequence):

        if input_sequence is not None:
            input_characters.feature.add().int64_list.value.append(input_character)

        if output_characters is not None:
            output_characters.feature.add().int64_list.value.append(output_character)

    return example_sequence

# Write all examples into a TFRecords file
def save_tf(filename):
    with open(filename, 'w') as fp:
        writer = tf.python_io.TFRecordWriter(fp.name)
        for sequence, label_sequence in zip(sequences, label_sequences):
            ex = make_example(sequence, label_sequence)
            writer.write(ex.SerializeToString())
        writer.close()

def read_and_decode_single_example(filename):

    filename_queue = tf.train.string_input_producer([filename],
                                                num_epochs=None)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    context_features = {
         "length": tf.FixedLenFeature([], dtype=tf.int64)
    }

    sequence_features = {
         "input_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64),
         "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }


    return serialized_example, context_features, sequence_features

save_tf(tmp_filename)
ex,context_features,sequence_features = read_and_decode_single_example(tmp_filename)
context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=ex,
    context_features=context_features,
    sequence_features=sequence_features
)

sequence = tf.contrib.learn.run_n(sequence_parsed, n=1, feed_dict=None)
#check if the saved data matches the input data
print(sequences[0] in sequence[0]['input_characters'])

The required changes were:

  1. sequence_length = sum([len(word) for word in input_sequence]) to sequence_length = len(input_sequence)

Otherwise it doesn't work for your example data

  1. extend was changed to append
like image 40
Maximilian Peters Avatar answered Oct 11 '22 16:10

Maximilian Peters