Saving a collection of variable length tensors to a TFRecords file in TensorFlow

Tags:

I'm trying to save a list of tensors of different lengths to a TFRecords file so that they can be easily loaded later on. The tensors in question are 1-dimensional arrays of integers.

The reason for this is that the tensors are the result of processing a large text file. This file is very large and processing it is slow, so I don't want to have to repeat that step every time I want to run my algorithms. I originally thought of loading the text file into regular Python lists or numpy arrays and then pickling those, but the conversion from those lists to tensors itself takes a very long time, so I don't want to have to wait for that every time I run my script, either. It seems that tensors cannot be pickled directly, and even if there is some workaround for this I am under the impression that TFRecords is the "correct" way to save tensor data.

However, I am not sure how to properly save the tensors to a TFRecords file and them load them back in as tensors. I did go through the TensorFlow tutorial in which MNIST data is saved to TFRecords files and then loaded, but there are a few differences between that and my use cases.

The following is a block of code intended to replicate the issues I'm having in a simpler case.

import tensorflow as tf

def _int64_list_feature(values):
   return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

filename = "/Users/me/tensorflow/test.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([2,3])}))
writer.write(example.SerializeToString())
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([8,5,7])}))
writer.write(example.SerializeToString())
writer.close()

First few lines are standard. I write two 1-D tensors to a TFRecords file, one with length 2 and one with length 3.

def read_my_file(filename_queue):
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example, features={'datalist': tf.VarLenFeature(tf.int64), })
    datalist = features['datalist']
    return datalist

The helper function that it seems you are supposed to use. I am not 100% sure why this is necessary, but I couldn't get it to work without writing this, and all of the examples have something like this. In my case, the data is unlabeled so I don't have a labels variable.

filename_queue = tf.train.string_input_producer([filename], 2)
datalists = read_my_file(filename_queue)
datalists_batch = tf.train.batch([datalists], batch_size=2)

More "boilerplate"-style code from the examples. Batch size is 2 because I only have 2 examples in this code.

datalists_batch will now be a sparse tensor that contains both my vectors, [2, 3] and [8, 5, 7], the first on top of the second. Therefore, I want to split them back into individual tensors. At this point, I was already concerned that the runtime of this might be pretty long too, because in my real code there are over 100,000 individual tensors that will be split.

split_list = tf.sparse_split(0, 2, datalists_batch)
sp0 = split_list[0]
sp1 = split_list[1]
sp0_dense = tf.sparse_tensor_to_dense(sp0)
sp1_dense = tf.sparse_tensor_to_dense(sp1)
sp0_dense = tf.squeeze(sp0_dense)
sp1_dense = tf.squeeze(sp1_dense)

split_list is now a list of the individual tensors, still in sparse format (and all having a length equal to the length of the longest tensor, which is in this case 3. They are also 2-dimensional with the other dimension 1, since the datalists_batch tensor was 2D). I must now manipulate the tensors to get them into proper format. In the real code, I would of course use a for-loop, but there are only 2 examples in this case. First, I convert them to dense tensors. However, in the case of sp0 this fills in the last index with a 0, since this tensor has length 3. (This issue is discussed below.) Then, I "squeeze" them so that they are actually considered tensors with length 3 instead of 1x3.

Finally, I need to remove the trailing zero from sp0. This part gave me difficulty. I don't know programmatically how many trailing zeros a particular tensor has. It is equal to the length of the longest tensor minus the length of this tensor, but I don't know the "real" lengths of the tensors without looking at the sparse indices, but I cannot access that without evaluating the "temp" tensor (since the indices are themselves tensors).

indices_0 = sp0.indices
indices_1 = sp1.indices
indices0_size = tf.shape(indices_0)
indices1_size = tf.shape(indices_1)

These are necessary for the aforementioned slicing operations.

sess = tf.Session()
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)

Initializations.

sp0_fixed = tf.slice(sp0_dense, [0], [sess.run(indices0_size[0])])
sp1_fixed = tf.slice(sp1_dense, [0], [sess.run(indices1_size[0])])
sess.run(sp0_fixed)
sess.run(sp1_fixed)

This is how I would do it. The problem is, I get strange errors when running these last three commands. I surmise that the problem is that I am creating new ops after sess.run has already been called (in the sp0_fixed line), so the graph is being run simultaneously. I think I should only run sess.run once. However, this makes it impossible for me to figure out the proper indices at which to slice each tensor (to remove trailing zeros). Thus, I don't know what to do next.

I have surprisingly found nothing at all helpful on how to do something like this (save and load variable length tensors to/from files) on Google, TensorFlow documentation, and StackOverflow. I am quite sure that I'm going about this the wrong way; even if there is a workaround to rewrite the last four lines so that the program behaves correctly, the code overall seems excessively complicated to perform a very basic functionality.

I would greatly appreciate any suggestions and feedback.

301

asked Jun 21 '16 23:06

user2258552

1 Answers

I dont have too much experience with tfRecords but here's one way to store and retrieve variable length arrays with tfRecords

writing a tfrecord

# creating a default session we'll use it later 

sess = tf.InteractiveSession( )

def get_writable( arr ):
    """
        this fucntion returns a serialized string 
        for input array of integers
        arr : input array
    """
    arr = tf.train.Int64List( value = arr)
    arr = tf.train.Feature(int64_list= arr )
    arr = tf.train.Features(feature =  { 'seqs': arr})
    arr = tf.train.Example( features = arr)
    return arr.SerializeToString()


filename = "./test2.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)

#writing 3 different sized arrays 
writer.write( get_writable([1,3,5,9]))
writer.write( get_writable([2,7,9]))
writer.write( get_writable([3,4,6,5,9]))
writer.close()

written the arrays into 'test2.tfrecord'

Reading the file(s)

##Reading from the tf_record file 

## creating a filename queue 
reader = tf.TFRecordReader( )
filename_queue = tf.train.string_input_producer(['test2.tfrecords'])

##getting the reader 
_, ser_ex = reader.read(filename_queue, )

##features that you want to extract 
read_features = {
    'seqs' : tf.VarLenFeature(dtype = tf.int64)
}
batchSize  = 2
# before parsing examples you must wrap it in tf.batch to get desired batch size  
batch = tf.train.batch([ser_ex], batch_size= batchSize , capacity=10)
read_data = tf.parse_example( batch, features= read_features )
tf.train.start_queue_runners( sess) 
# starting reading queues are requred before reding the data

Now we're ready to read contents of the tfRecord file

batches = 3
for _ in range(batches):

    #get the next sparse tensor of shape (batchSize X elements in the largest array ) 
    #every time you invoke read_data.values()

    sparse_tensor = (list(read_data.values())[0]).eval()

    # as the batch size is larger than 1 
    # you'd want seperate lists that you fed 
    #at the time of writing the tf_record file 

    for i in tf.sparse_split(axis=  0, num_split=batchSize, sp_input= sparse_tensor  ):
        i = i.eval()
        shape =  [1, (i).indices.shape[0]] 
        #getting individual shapes of different sparse tensors  
        tens = tf.sparse_to_dense(sparse_indices=i.indices ,sparse_values= i.values , output_shape= shape) 
        #converting them into dense tensors
        print(tens.eval()) 
        #evaluating the final Dense Tensor

Check out this post, great explanation to get started with tfRecords

178

answered Oct 21 '22 19:10

dragonLOLz

Related questions
                            
                                Spyne Soap server with WSDL-file
                            
                                How to Read Data from Arduino with Raspberry pi via I2C
                            
                                Using groupby and apply to add column to each group
                            
                                Flask-Restful taking over exception handling from Flask during non debug mode
                            
                                pandas read_sql return query string with arguments passed
                            
                                Sublime Text 3 Python Interactive Console? [duplicate]
                            
                                How to list the names of PyPI packages corresponding to imports in a script?
                            
                                Rendering Bokeh widgets in django Templates
                            
                                Crawling slows down drastically towards the end
                            
                                Skip loop if a function is taking too long?
                            
                                Matplotlibs pyplot.subplots() crashes kernel
                            
                                Shift theorem in Discrete Fourier Transform
                            
                                How to get python 3.5.1 running with heroku local?
                            
                                Implementing seq2seq with beam search
                            
                                How can I create an AI for tic tac toe in Python using ANN and genetic algorithm?
                            
                                Using django-filer, can I chose the folder that images go into, from 'Unsorted Uploads'
                            
                                Why do I get "GurobiError: Variable not in model" after using Model.copy()?
                            
                                how to click on the link using python selenium?
                            
                                Docker / Celery: Can't get celery to run
                            
                                How do i use Linux terminal commands like CD and LS? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Saving a collection of variable length tensors to a TFRecords file in TensorFlow

Tags:

python

io

tensorflow

deep-learning

user2258552

People also ask

1 Answers

dragonLOLz

Recent Activity

Donate For Us