Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should one use tf.train.BytesList, tf.train.FloatList, and tf.train.Int64List for data to be stored in a tf.train.Feature?

TensorFlow provides 3 different formats for data to be stored in a tf.train.Feature. These are:

tf.train.BytesList
tf.train.FloatList
tf.train.Int64List

I often struggle to choose between tf.train.Int64List / tf.train.FloatList and tf.train.BytesList.

I see some examples online where they convert ints/floats into bytes and then store them in a tf.train.BytesList. Is this preferable to using one of the other formats? If so, why does TensorFlow even provide tf.train.Int64List and tf.train.FloatList as optional formats when you could just convert them to bytes and use tf.train.BytesList?

Thank you.

like image 741
michael_question_answerer Avatar asked Mar 16 '19 20:03

michael_question_answerer


People also ask

Why do we create tf Records?

The TFRecord format is a simple format for storing a sequence of binary records. Converting your data into TFRecord has many advantages, such as: More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files.

What is tf train example?

The tf. train. Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

What is the ideal size of a TFRecord file size?

Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."

What is a TFRecord file?

TFRecord is a binary format for efficiently encoding long sequences of tf. Example protos. TFRecord files are easily loaded by TensorFlow through the tf. data package as described here and here.


1 Answers

Because bytes list will require more memory. It's designed to store string data, or for example numpy arrays converted to single bytestring. Consider example:

def int64_feature(value):
    if type(value) != list:
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def float_feature(value):
    if type(value) != list:
        value = [value]
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

writer = tf.python_io.TFRecordWriter('file.tfrecords')
bytes = np.array(1.1).tostring() 
int = 1
float = 1.1
example = tf.train.Example(features=tf.train.Features(feature={'1': float_feature(float)}))
writer.write(example.SerializeToString())
writer.close()

for str_rec in tf.python_io.tf_record_iterator('file.tfrecords'):
    example = tf.train.Example()
    example.ParseFromString(str_rec)
    str = (example.features.feature['1'].float_list.value[0])
    print(getsizeof(str))

For dtype float it will output 24 bytes, the lowest value. However, you can't pass int to a tf.train.FloatList. int dtype will occupy 28 bytes in this case, while bytes will be 41 undecoded(before applying np.fromstring) and even more after.

like image 144
Sharky Avatar answered Nov 09 '22 17:11

Sharky