What would be the best way to store sparse vector to TFRecord? My sparse vector only contains ones and zeros so I decided I'll just save indexes where 'ones' are located like this:
example = tf.train.Example(
features=tf.train.Features(
feature={
'label': self._int64_feature(label),
'features' : self._int64_feature_list(values)
}
)
)
Here, values
is list containing indexes of 'ones'. This values
array sometimes contains hundreds of elements, sometimes none at all. After that I simply save the serialized example to tfrecord. Later, I'm reading tfrecord like this:
features = tf.parse_single_example(
serialized_example,
features={
# We know the length of both fields. If not the
# tf.VarLenFeature could be used
'label': tf.FixedLenFeature([], dtype=tf.int64),
'features': tf.VarLenFeature(dtype=tf.int64)
}
)
label = features['label']
values = features['features']
This doesn't work because values
array is recognized as a sparse array and I don't get data that I have saved. What is the best way to store sparse tensor in tfrecords and how to read it?
The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10 MB+ and ideally 100 MB+) so that you can benefit from I/O prefetching.
TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.
Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.
If you're just serializing the locations of 1s you should be able to get out your correct sparse tensor with a little bit of trickery:
The parsed sparse tensor features['features']
will look something like this:
features['features'].indices: [[batch_id, position]...]
Where position
is a useless enumeration.
but you really want feature['features']
to look like [[batch_id, one_position], ...]
Where one_position
is the actual value you specified in your sparse tensor.
So:
indices = features['features'].indices
indices = tf.transpose(indices)
# Now looks like [[batch_id, batch_id, ...], [position, position, ...]]
indices = tf.stack([indices[0], features['features'].values])
# Now looks like [[batch_id, batch_id, ...], [one_position, one_position, ...]]
indices = tf.transpose(indices)
# Now looks like [[batch_id, one_position], [batch_id, one_position], ...]]
features['features'] = tf.SparseTensor(
indices=indices,
values=tf.ones(shape=tf.shape(indices)[:1])
dense_shape=1 + tf.reduce_max(indices, axis=[0])
)
Voila! features['features']
now represents a matrix that is your batch of sparse vectors concatenated.
NOTE: that if you want to treat this as a dense tensor you'll have to do tf.sparse_to_dense
AND the dense tensor will have shape [None, None]
(which makes it kind of hard to work with]. If you know the max possible vector length you might want to hardcode it: dense_shape=[batch_size, max_vector_length]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With