Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bulk write TFRecords?

I have a CSV with approximately 40 million rows. Each row is a training instance. As per the documentation on consuming TFRecords I am trying to encode and save the data in a TFRecord file.

All the examples I have found (even the ones in the TensorFlow repo) show the process of creating a TFRecord is dependant on the class TFRecordWriter. This class has a method write that takes as input a serialised string representation of the data and writes it to disk. However, this appears to be done one training instance at a time.

How do I write a batch of the serialised data?

Let's say I have a funtion:

  def write_row(sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example.SerializeToString())

Writing to disk 40 million times (once for each example) is going to be incredibly slow. It would be far more efficient to batch this data and write 50k or 100k examples at a time (as far as the machine's resources will allow). However there does not appear to be any method to do this inside TFRecordWriter.

Something along the lines of:

class MyRecordWriter:

  def __init__(self, writer):
    self.records = []
    self.counter = 0
    self.writer = writer

  def write_row_batched(self, sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    self.records.append(example.SerializeToString())
    self.counter += 1
    if self.counter >= 10000:
      self.writer.write(os.linesep.join(self.records))
      self.counter = 0
      self.records = []

But when reading the file created by this method I get the following error:

tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Could not parse example input, value: '
��

label

��
one_hot����
��

Note: I could change the encoding process so that each example proto contains several thousand examples instead of just one but I don't want to pre-batch the data when writing to the TFrecord file in this way as it will introduce extra overhead in my training pipeline when I want to use the file for training with different batch sizes.

like image 930
Insectatorious Avatar asked Feb 21 '18 16:02

Insectatorious


People also ask

How do you write a TFRecord?

Creating TFRecord Files with Code Most often we have labeled data in PASCAL VOC XML or COCO JSON. Creating a TFRecord file from this data requires following a multistep process: (1) creating a TensorFlow Object Detection CSV (2) Using that TensorFlow Object Detection CSV to create TFRecord files.

What is the ideal size of a TFRecord file size?

The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10 MB+ and ideally 100 MB+) so that you can benefit from I/O prefetching.

What is tf example?

proto files, these are often the easiest way to understand a message type. The tf. Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

What is the advantage of TFRecord?

Converting your data into TFRecord has many advantages, such as: More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files. Fast I/O: the TFRecord format can be read with parallel I/O operations, which is useful for TPUs or multiple hosts.


1 Answers

TFRecords is a binary format. With the following line you are treating it like a text file: self.writer.write(os.linesep.join(self.records))

That is because you are using the operation system depending linesep (either \n or \r\n).

Solution: Just write the records. You are asking to batch write them. You can use a buffered writer. For 40 million rows you might also want to consider splitting the data up into separate files to allow better parallelisation.

When using TFRecordWriter: The file is already buffered.

Evidence for that is found in the source:

  • tf_record.py calls pywrap_tensorflow.PyRecordWriter_New
  • PyRecordWriter calls Env::Default()->NewWritableFile
  • Env->NewWritableFile calls NewWritableFile on the matching FileSystem
  • e.g. PosixFileSystem calls fopen
  • fopen returns a stream which "is fully buffered by default if it is known to not refer to an interactive device"
  • That will be file system dependent but WritableFile notes "The implementation must provide buffering since callers may append small fragments at a time to the file."
like image 80
de1 Avatar answered Sep 30 '22 05:09

de1