Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving and reading variable size list from TFRecord

Tags:

tensorflow

What would be the best way to store sparse vector to TFRecord? My sparse vector only contains ones and zeros so I decided I'll just save indexes where 'ones' are located like this:

example = tf.train.Example(
        features=tf.train.Features(
            feature={
                'label': self._int64_feature(label),
                'features' : self._int64_feature_list(values)
            }
        )
    )

Here, values is list containing indexes of 'ones'. This values array sometimes contains hundreds of elements, sometimes none at all. After that I simply save the serialized example to tfrecord. Later, I'm reading tfrecord like this:

features = tf.parse_single_example(
    serialized_example,
    features={
        # We know the length of both fields. If not the
        # tf.VarLenFeature could be used
        'label': tf.FixedLenFeature([], dtype=tf.int64),
        'features': tf.VarLenFeature(dtype=tf.int64)
    }
)

label = features['label']
values = features['features']

This doesn't work because values array is recognized as a sparse array and I don't get data that I have saved. What is the best way to store sparse tensor in tfrecords and how to read it?

like image 641
Drag0 Avatar asked May 17 '16 08:05

Drag0


People also ask

What is the ideal size of a TFRecord file size?

The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10 MB+ and ideally 100 MB+) so that you can benefit from I/O prefetching.

What is a TFRecord file?

TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.

How do I make a tf record?

Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.


1 Answers

If you're just serializing the locations of 1s you should be able to get out your correct sparse tensor with a little bit of trickery:

The parsed sparse tensor features['features'] will look something like this:

features['features'].indices: [[batch_id, position]...]

Where position is a useless enumeration.

but you really want feature['features'] to look like [[batch_id, one_position], ...]

Where one_position is the actual value you specified in your sparse tensor.

So:

indices = features['features'].indices
indices = tf.transpose(indices) 
# Now looks like [[batch_id, batch_id, ...], [position, position, ...]]
indices = tf.stack([indices[0], features['features'].values])
# Now looks like [[batch_id, batch_id, ...], [one_position, one_position, ...]]
indices = tf.transpose(indices)
# Now looks like [[batch_id, one_position], [batch_id, one_position], ...]]
features['features'] = tf.SparseTensor(
   indices=indices,
   values=tf.ones(shape=tf.shape(indices)[:1])
   dense_shape=1 + tf.reduce_max(indices, axis=[0])
)

Voila! features['features'] now represents a matrix that is your batch of sparse vectors concatenated.

NOTE: that if you want to treat this as a dense tensor you'll have to do tf.sparse_to_dense AND the dense tensor will have shape [None, None] (which makes it kind of hard to work with]. If you know the max possible vector length you might want to hardcode it: dense_shape=[batch_size, max_vector_length]

like image 82
Eli Bixby Avatar answered Oct 16 '22 21:10

Eli Bixby