Notes:
this question extends upon a previous question of mine. In that question I ask about the best way to store some dummy data as Example
and SequenceExample
seeking to know which is better for data similar to dummy data provided. I provide both explicit formulations of the Example
and SequenceExample
construction as well as, in the answers, a programatic way to do so.
Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you can try the code out yourself to assist. All the necessary code is there and it is generously commented.
I am trying to learn how to convert my data into TF Records as the claimed benefits are worthwhile for my data. However, the documentation leaves a lot to be desired and the tutorials / blogs (that I have seen) which try to go deeper, really only touch the surface or rehash the sparse docs that exist.
For the demo data considered in my previous question - as well as here - I have written a decent class that takes:
and can encode the data in 1 of 6 forms:
int64
in this case) with meta data tacked onnumpy.ndarray.tostring()
) with meta data tacked onExample, with sequence / classes dumped as byte string with meta data tacked on
SequenceExample, with sequence channels / classes separate in a numeric type and meta data as context
This works fine.
In the Colab I show how to write dummy data all in the same file as well as in separate files.
My question is how can I recover this data?
I given 4 attempts at trying to do so in the linked file.
Why is TFReader under a different sub-package from TFWriter?
A TFRecord is when a sequence of such records serializes to binary. The binary format takes less memory for storage in comparison to any other data formats. That’s what I’m going to do now. I will convert all the records of a dataset to TFRecords which can be serialized into binary and can be written in a file. Tensorflow says that,
In this post, I’m going to discuss Tensorflow Records. Tensorflow recommends to store and read data in tfRecords format. It internally uses Protocol Buffers to serialize/deserialize the data and store them in bytes, as it takes less space to hold an ample amount of data and to transfer them as well.
Any byte-string that can be decoded in TensorFlow could be stored in a TFRecord file. Examples include: lines of text, JSON (using tf.io.decode_json_example), encoded image data, or serialized tf.Tensors (using tf.io.serialize_tensor/tf.io.parse_tensor).
Let’s start writing to a TFRecord file. The process is as simple as follows: Read MNIST data and pre-process it. Write MNIST data to a TFRecord file. NOTE: You may object by saying, why do we have to write MNIST data to a TFRecord file when MNIST is a small and ready-to-use dataset? The answer is simple. Using MNIST is just for education purposes.
Solved by updating the features to include shape information and remembering that SequenceExample
are unnamed FeatureLists
.
context_features = {
'Name' : tf.FixedLenFeature([], dtype=tf.string),
'Val_1': tf.FixedLenFeature([], dtype=tf.float32),
'Val_2': tf.FixedLenFeature([], dtype=tf.float32)
}
sequence_features = {
'sequence': tf.FixedLenSequenceFeature((3,), dtype=tf.int64),
'pclasses' : tf.FixedLenSequenceFeature((3,), dtype=tf.float32),
}
def parse(record):
parsed = tf.parse_single_sequence_example(
record,
context_features=context_features,
sequence_features=sequence_features
)
return parsed
filenames = [os.path.join(os.getcwd(),f"dummy_sequences_{i}.tfrecords") for i in range(3)]
dataset = tf.data.TFRecordDataset(filenames).map(lambda r: parse(r))
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(dataset)
for _ in range(2):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(3):
ne = sess.run(next_element)
print(ne)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With