Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving a collection of variable length tensors to a TFRecords file in TensorFlow

I'm trying to save a list of tensors of different lengths to a TFRecords file so that they can be easily loaded later on. The tensors in question are 1-dimensional arrays of integers.

The reason for this is that the tensors are the result of processing a large text file. This file is very large and processing it is slow, so I don't want to have to repeat that step every time I want to run my algorithms. I originally thought of loading the text file into regular Python lists or numpy arrays and then pickling those, but the conversion from those lists to tensors itself takes a very long time, so I don't want to have to wait for that every time I run my script, either. It seems that tensors cannot be pickled directly, and even if there is some workaround for this I am under the impression that TFRecords is the "correct" way to save tensor data.

However, I am not sure how to properly save the tensors to a TFRecords file and them load them back in as tensors. I did go through the TensorFlow tutorial in which MNIST data is saved to TFRecords files and then loaded, but there are a few differences between that and my use cases.

The following is a block of code intended to replicate the issues I'm having in a simpler case.

import tensorflow as tf

def _int64_list_feature(values):
   return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

filename = "/Users/me/tensorflow/test.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([2,3])}))
writer.write(example.SerializeToString())
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([8,5,7])}))
writer.write(example.SerializeToString())
writer.close()

First few lines are standard. I write two 1-D tensors to a TFRecords file, one with length 2 and one with length 3.

def read_my_file(filename_queue):
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example, features={'datalist': tf.VarLenFeature(tf.int64), })
    datalist = features['datalist']
    return datalist

The helper function that it seems you are supposed to use. I am not 100% sure why this is necessary, but I couldn't get it to work without writing this, and all of the examples have something like this. In my case, the data is unlabeled so I don't have a labels variable.

filename_queue = tf.train.string_input_producer([filename], 2)
datalists = read_my_file(filename_queue)
datalists_batch = tf.train.batch([datalists], batch_size=2)

More "boilerplate"-style code from the examples. Batch size is 2 because I only have 2 examples in this code.

datalists_batch will now be a sparse tensor that contains both my vectors, [2, 3] and [8, 5, 7], the first on top of the second. Therefore, I want to split them back into individual tensors. At this point, I was already concerned that the runtime of this might be pretty long too, because in my real code there are over 100,000 individual tensors that will be split.

split_list = tf.sparse_split(0, 2, datalists_batch)
sp0 = split_list[0]
sp1 = split_list[1]
sp0_dense = tf.sparse_tensor_to_dense(sp0)
sp1_dense = tf.sparse_tensor_to_dense(sp1)
sp0_dense = tf.squeeze(sp0_dense)
sp1_dense = tf.squeeze(sp1_dense)

split_list is now a list of the individual tensors, still in sparse format (and all having a length equal to the length of the longest tensor, which is in this case 3. They are also 2-dimensional with the other dimension 1, since the datalists_batch tensor was 2D). I must now manipulate the tensors to get them into proper format. In the real code, I would of course use a for-loop, but there are only 2 examples in this case. First, I convert them to dense tensors. However, in the case of sp0 this fills in the last index with a 0, since this tensor has length 3. (This issue is discussed below.) Then, I "squeeze" them so that they are actually considered tensors with length 3 instead of 1x3.

Finally, I need to remove the trailing zero from sp0. This part gave me difficulty. I don't know programmatically how many trailing zeros a particular tensor has. It is equal to the length of the longest tensor minus the length of this tensor, but I don't know the "real" lengths of the tensors without looking at the sparse indices, but I cannot access that without evaluating the "temp" tensor (since the indices are themselves tensors).

indices_0 = sp0.indices
indices_1 = sp1.indices
indices0_size = tf.shape(indices_0)
indices1_size = tf.shape(indices_1)

These are necessary for the aforementioned slicing operations.

sess = tf.Session()
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

Initializations.

sp0_fixed = tf.slice(sp0_dense, [0], [sess.run(indices0_size[0])])
sp1_fixed = tf.slice(sp1_dense, [0], [sess.run(indices1_size[0])])
sess.run(sp0_fixed)
sess.run(sp1_fixed)

This is how I would do it. The problem is, I get strange errors when running these last three commands. I surmise that the problem is that I am creating new ops after sess.run has already been called (in the sp0_fixed line), so the graph is being run simultaneously. I think I should only run sess.run once. However, this makes it impossible for me to figure out the proper indices at which to slice each tensor (to remove trailing zeros). Thus, I don't know what to do next.

I have surprisingly found nothing at all helpful on how to do something like this (save and load variable length tensors to/from files) on Google, TensorFlow documentation, and StackOverflow. I am quite sure that I'm going about this the wrong way; even if there is a workaround to rewrite the last four lines so that the program behaves correctly, the code overall seems excessively complicated to perform a very basic functionality.

I would greatly appreciate any suggestions and feedback.

like image 301
user2258552 Avatar asked Jun 21 '16 23:06

user2258552


People also ask

How do I save a TensorFlow tensor?

One way would be to do a. numpy(). save('file. npy') then converting back to a tensor after loading.

What is a TFRecord file?

The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.


1 Answers

I dont have too much experience with tfRecords but here's one way to store and retrieve variable length arrays with tfRecords

writing a tfrecord

# creating a default session we'll use it later 

sess = tf.InteractiveSession( )

def get_writable( arr ):
    """
        this fucntion returns a serialized string 
        for input array of integers
        arr : input array
    """
    arr = tf.train.Int64List( value = arr)
    arr = tf.train.Feature(int64_list= arr )
    arr = tf.train.Features(feature =  { 'seqs': arr})
    arr = tf.train.Example( features = arr)
    return arr.SerializeToString()


filename = "./test2.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)

#writing 3 different sized arrays 
writer.write( get_writable([1,3,5,9]))
writer.write( get_writable([2,7,9]))
writer.write( get_writable([3,4,6,5,9]))
writer.close()

written the arrays into 'test2.tfrecord'

Reading the file(s)

##Reading from the tf_record file 

## creating a filename queue 
reader = tf.TFRecordReader( )
filename_queue = tf.train.string_input_producer(['test2.tfrecords'])

##getting the reader 
_, ser_ex = reader.read(filename_queue, )

##features that you want to extract 
read_features = {
    'seqs' : tf.VarLenFeature(dtype = tf.int64)
}
batchSize  = 2
# before parsing examples you must wrap it in tf.batch to get desired batch size  
batch = tf.train.batch([ser_ex], batch_size= batchSize , capacity=10)
read_data = tf.parse_example( batch, features= read_features )
tf.train.start_queue_runners( sess) 
# starting reading queues are requred before reding the data  

Now we're ready to read contents of the tfRecord file

batches = 3
for _ in range(batches):

    #get the next sparse tensor of shape (batchSize X elements in the largest array ) 
    #every time you invoke read_data.values()

    sparse_tensor = (list(read_data.values())[0]).eval()

    # as the batch size is larger than 1 
    # you'd want seperate lists that you fed 
    #at the time of writing the tf_record file 

    for i in tf.sparse_split(axis=  0, num_split=batchSize, sp_input= sparse_tensor  ):
        i = i.eval()
        shape =  [1, (i).indices.shape[0]] 
        #getting individual shapes of different sparse tensors  
        tens = tf.sparse_to_dense(sparse_indices=i.indices ,sparse_values= i.values , output_shape= shape) 
        #converting them into dense tensors
        print(tens.eval()) 
        #evaluating the final Dense Tensor 

Check out this post, great explanation to get started with tfRecords

like image 178
dragonLOLz Avatar answered Oct 21 '22 19:10

dragonLOLz