TensorFlow provides 3 different formats for data to be stored in a tf.train.Feature
. These are:
tf.train.BytesList
tf.train.FloatList
tf.train.Int64List
I often struggle to choose between tf.train.Int64List
/ tf.train.FloatList
and tf.train.BytesList
.
I see some examples online where they convert ints/floats into bytes and then store them in a tf.train.BytesList
. Is this preferable to using one of the other formats? If so, why does TensorFlow even provide tf.train.Int64List
and tf.train.FloatList
as optional formats when you could just convert them to bytes and use tf.train.BytesList
?
Thank you.
The TFRecord format is a simple format for storing a sequence of binary records. Converting your data into TFRecord has many advantages, such as: More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files.
The tf. train. Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.
Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."
TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.
Because bytes list will require more memory. It's designed to store string data, or for example numpy arrays converted to single bytestring. Consider example:
def int64_feature(value):
if type(value) != list:
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def float_feature(value):
if type(value) != list:
value = [value]
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter('file.tfrecords')
bytes = np.array(1.1).tostring()
int = 1
float = 1.1
example = tf.train.Example(features=tf.train.Features(feature={'1': float_feature(float)}))
writer.write(example.SerializeToString())
writer.close()
for str_rec in tf.python_io.tf_record_iterator('file.tfrecords'):
example = tf.train.Example()
example.ParseFromString(str_rec)
str = (example.features.feature['1'].float_list.value[0])
print(getsizeof(str))
For dtype float
it will output 24 bytes, the lowest value. However, you can't pass int
to a tf.train.FloatList
. int
dtype will occupy 28 bytes in this case, while bytes will be 41 undecoded(before applying np.fromstring
) and even more after.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With