Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow 1.10 TFRecordDataset - recovering TFRecords

Notes:

  1. this question extends upon a previous question of mine. In that question I ask about the best way to store some dummy data as Example and SequenceExample seeking to know which is better for data similar to dummy data provided. I provide both explicit formulations of the Example and SequenceExample construction as well as, in the answers, a programatic way to do so.

  2. Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you can try the code out yourself to assist. All the necessary code is there and it is generously commented.

I am trying to learn how to convert my data into TF Records as the claimed benefits are worthwhile for my data. However, the documentation leaves a lot to be desired and the tutorials / blogs (that I have seen) which try to go deeper, really only touch the surface or rehash the sparse docs that exist.

For the demo data considered in my previous question - as well as here - I have written a decent class that takes:

  • a sequence with n channels (in this example it is integer based, of fixed-length and with n channels)
  • soft-labeled class probabilities (in this example there are n classes and float based)
  • some meta data (in this example a string and two floats)

and can encode the data in 1 of 6 forms:

  1. Example, with sequence channels / classes separate in a numeric type (int64 in this case) with meta data tacked on
  2. Example, with sequence channels / classes separate as a byte string (via numpy.ndarray.tostring()) with meta data tacked on
  3. Example, with sequence / classes dumped as byte string with meta data tacked on

  4. SequenceExample, with sequence channels / classes separate in a numeric type and meta data as context

  5. SequenceExample, with sequence channels separate as a byte string and meta data as context
  6. SequenceExample, with sequence and classes dumped as byte string and meta data as context

This works fine.

In the Colab I show how to write dummy data all in the same file as well as in separate files.

My question is how can I recover this data?

I given 4 attempts at trying to do so in the linked file.

Why is TFReader under a different sub-package from TFWriter?

like image 217
SumNeuron Avatar asked Aug 28 '18 19:08

SumNeuron


People also ask

What is a tfrecord in TensorFlow?

A TFRecord is when a sequence of such records serializes to binary. The binary format takes less memory for storage in comparison to any other data formats. That’s what I’m going to do now. I will convert all the records of a dataset to TFRecords which can be serialized into binary and can be written in a file. Tensorflow says that,

What are TensorFlow Records and how to read them?

In this post, I’m going to discuss Tensorflow Records. Tensorflow recommends to store and read data in tfRecords format. It internally uses Protocol Buffers to serialize/deserialize the data and store them in bytes, as it takes less space to hold an ample amount of data and to transfer them as well.

What data can be stored in a tfrecord file?

Any byte-string that can be decoded in TensorFlow could be stored in a TFRecord file. Examples include: lines of text, JSON (using tf.io.decode_json_example), encoded image data, or serialized tf.Tensors (using tf.io.serialize_tensor/tf.io.parse_tensor).

How to write MNIST data to a tfrecord file?

Let’s start writing to a TFRecord file. The process is as simple as follows: Read MNIST data and pre-process it. Write MNIST data to a TFRecord file. NOTE: You may object by saying, why do we have to write MNIST data to a TFRecord file when MNIST is a small and ready-to-use dataset? The answer is simple. Using MNIST is just for education purposes.


Video Answer


1 Answers

Solved by updating the features to include shape information and remembering that SequenceExample are unnamed FeatureLists.

context_features = {
    'Name' : tf.FixedLenFeature([], dtype=tf.string),
    'Val_1': tf.FixedLenFeature([], dtype=tf.float32),
    'Val_2': tf.FixedLenFeature([], dtype=tf.float32)
}

sequence_features = {
    'sequence': tf.FixedLenSequenceFeature((3,), dtype=tf.int64),
    'pclasses'  : tf.FixedLenSequenceFeature((3,), dtype=tf.float32),
}

def parse(record):
  parsed = tf.parse_single_sequence_example(
        record,
        context_features=context_features,
        sequence_features=sequence_features
  )
  return parsed


filenames = [os.path.join(os.getcwd(),f"dummy_sequences_{i}.tfrecords") for i in range(3)]
dataset = tf.data.TFRecordDataset(filenames).map(lambda r: parse(r))

iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(dataset)

for _ in range(2):
  # Initialize an iterator over the training dataset.
  sess.run(training_init_op)
  for _ in range(3):
    ne = sess.run(next_element)
    print(ne)
like image 192
SumNeuron Avatar answered Oct 18 '22 23:10

SumNeuron