I'm trying to save a list of tensors of different lengths to a TFRecords file so that they can be easily loaded later on. The tensors in question are 1-dimensional arrays of integers.
The reason for this is that the tensors are the result of processing a large text file. This file is very large and processing it is slow, so I don't want to have to repeat that step every time I want to run my algorithms. I originally thought of loading the text file into regular Python lists or numpy arrays and then pickling those, but the conversion from those lists to tensors itself takes a very long time, so I don't want to have to wait for that every time I run my script, either. It seems that tensors cannot be pickled directly, and even if there is some workaround for this I am under the impression that TFRecords is the "correct" way to save tensor data.
However, I am not sure how to properly save the tensors to a TFRecords file and them load them back in as tensors. I did go through the TensorFlow tutorial in which MNIST data is saved to TFRecords files and then loaded, but there are a few differences between that and my use cases.
The following is a block of code intended to replicate the issues I'm having in a simpler case.
import tensorflow as tf
def _int64_list_feature(values):
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
filename = "/Users/me/tensorflow/test.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([2,3])}))
writer.write(example.SerializeToString())
example = tf.train.Example(features=tf.train.Features(feature={'datalist': _int64_list_feature([8,5,7])}))
writer.write(example.SerializeToString())
writer.close()
First few lines are standard. I write two 1-D tensors to a TFRecords file, one with length 2 and one with length 3.
def read_my_file(filename_queue):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example, features={'datalist': tf.VarLenFeature(tf.int64), })
datalist = features['datalist']
return datalist
The helper function that it seems you are supposed to use. I am not 100% sure why this is necessary, but I couldn't get it to work without writing this, and all of the examples have something like this. In my case, the data is unlabeled so I don't have a labels variable.
filename_queue = tf.train.string_input_producer([filename], 2)
datalists = read_my_file(filename_queue)
datalists_batch = tf.train.batch([datalists], batch_size=2)
More "boilerplate"-style code from the examples. Batch size is 2 because I only have 2 examples in this code.
datalists_batch
will now be a sparse tensor that contains both my vectors, [2, 3]
and [8, 5, 7]
, the first on top of the second. Therefore, I want to split them back into individual tensors. At this point, I was already concerned that the runtime of this might be pretty long too, because in my real code there are over 100,000 individual tensors that will be split.
split_list = tf.sparse_split(0, 2, datalists_batch)
sp0 = split_list[0]
sp1 = split_list[1]
sp0_dense = tf.sparse_tensor_to_dense(sp0)
sp1_dense = tf.sparse_tensor_to_dense(sp1)
sp0_dense = tf.squeeze(sp0_dense)
sp1_dense = tf.squeeze(sp1_dense)
split_list
is now a list of the individual tensors, still in sparse format (and all having a length equal to the length of the longest tensor, which is in this case 3. They are also 2-dimensional with the other dimension 1, since the datalists_batch
tensor was 2D). I must now manipulate the tensors to get them into proper format. In the real code, I would of course use a for-loop, but there are only 2 examples in this case. First, I convert them to dense tensors. However, in the case of sp0
this fills in the last index with a 0
, since this tensor has length 3. (This issue is discussed below.) Then, I "squeeze" them so that they are actually considered tensors with length 3 instead of 1x3.
Finally, I need to remove the trailing zero from sp0
. This part gave me difficulty. I don't know programmatically how many trailing zeros a particular tensor has. It is equal to the length of the longest tensor minus the length of this tensor, but I don't know the "real" lengths of the tensors without looking at the sparse indices, but I cannot access that without evaluating the "temp" tensor (since the indices are themselves tensors).
indices_0 = sp0.indices
indices_1 = sp1.indices
indices0_size = tf.shape(indices_0)
indices1_size = tf.shape(indices_1)
These are necessary for the aforementioned slicing operations.
sess = tf.Session()
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
Initializations.
sp0_fixed = tf.slice(sp0_dense, [0], [sess.run(indices0_size[0])])
sp1_fixed = tf.slice(sp1_dense, [0], [sess.run(indices1_size[0])])
sess.run(sp0_fixed)
sess.run(sp1_fixed)
This is how I would do it. The problem is, I get strange errors when running these last three commands. I surmise that the problem is that I am creating new ops after sess.run has already been called (in the sp0_fixed
line), so the graph is being run simultaneously. I think I should only run sess.run once. However, this makes it impossible for me to figure out the proper indices at which to slice each tensor (to remove trailing zeros). Thus, I don't know what to do next.
I have surprisingly found nothing at all helpful on how to do something like this (save and load variable length tensors to/from files) on Google, TensorFlow documentation, and StackOverflow. I am quite sure that I'm going about this the wrong way; even if there is a workaround to rewrite the last four lines so that the program behaves correctly, the code overall seems excessively complicated to perform a very basic functionality.
I would greatly appreciate any suggestions and feedback.
One way would be to do a. numpy(). save('file. npy') then converting back to a tensor after loading.
The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.
I dont have too much experience with tfRecords but here's one way to store and retrieve variable length arrays with tfRecords
writing a tfrecord
# creating a default session we'll use it later
sess = tf.InteractiveSession( )
def get_writable( arr ):
"""
this fucntion returns a serialized string
for input array of integers
arr : input array
"""
arr = tf.train.Int64List( value = arr)
arr = tf.train.Feature(int64_list= arr )
arr = tf.train.Features(feature = { 'seqs': arr})
arr = tf.train.Example( features = arr)
return arr.SerializeToString()
filename = "./test2.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
#writing 3 different sized arrays
writer.write( get_writable([1,3,5,9]))
writer.write( get_writable([2,7,9]))
writer.write( get_writable([3,4,6,5,9]))
writer.close()
written the arrays into 'test2.tfrecord'
Reading the file(s)
##Reading from the tf_record file
## creating a filename queue
reader = tf.TFRecordReader( )
filename_queue = tf.train.string_input_producer(['test2.tfrecords'])
##getting the reader
_, ser_ex = reader.read(filename_queue, )
##features that you want to extract
read_features = {
'seqs' : tf.VarLenFeature(dtype = tf.int64)
}
batchSize = 2
# before parsing examples you must wrap it in tf.batch to get desired batch size
batch = tf.train.batch([ser_ex], batch_size= batchSize , capacity=10)
read_data = tf.parse_example( batch, features= read_features )
tf.train.start_queue_runners( sess)
# starting reading queues are requred before reding the data
Now we're ready to read contents of the tfRecord file
batches = 3
for _ in range(batches):
#get the next sparse tensor of shape (batchSize X elements in the largest array )
#every time you invoke read_data.values()
sparse_tensor = (list(read_data.values())[0]).eval()
# as the batch size is larger than 1
# you'd want seperate lists that you fed
#at the time of writing the tf_record file
for i in tf.sparse_split(axis= 0, num_split=batchSize, sp_input= sparse_tensor ):
i = i.eval()
shape = [1, (i).indices.shape[0]]
#getting individual shapes of different sparse tensors
tens = tf.sparse_to_dense(sparse_indices=i.indices ,sparse_values= i.values , output_shape= shape)
#converting them into dense tensors
print(tens.eval())
#evaluating the final Dense Tensor
Check out this post, great explanation to get started with tfRecords
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With