Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create multiple TFRecord files instead of making a big one and then splitting it up?

I'm dealing with quite big time series dataset, one that prepared as SequenceExamples is then written to a TFRecord. This results in a quite large file (over 100GB) but I'd like to have it stored in chunks. I've tried:

file = '/path/to/tf_record_0.tfrecords'
file_index = 0

   for record in dataset:
      # fill the time series window, prepare the sequence_example, etc.

      if os.path.exists(file) and os.path.getsize(file) > 123456789:
         file = file.replace(str(file_index), str(file_index + 1))
         file_index += 1

            with tf.io.TFRecordWriter(file) as writer:
               writer.write(sequence_example.SerializeToString())

...but since TFRecordWriter opens files like Python's open(file, mode='w') it overwrites itself every time it enters the with block (apart from it being really ugly solution) and from what I've read there's no way to change that behavior. Changing path to file inside with block obviously throws an error.

So my question is, is there a way to create next TFRecord file when current reaches certain size while looping and working with my dataset? And is there a benefit of having smaller TFRecord files anyway when I'm not dealing with any type of bottleneck apart from lack of system memory? If I'm correct Tensorflow can read it from the disk without issues (although there might be other reasons one would prefer to have multiple files anyway).

One thing I can think of is creating some sort of buffer in a list for ready-to-be-saved sequences and creating/saving to TFRecord once that buffer reaches some threshold.

like image 677
Coldark Avatar asked Nov 06 '22 09:11

Coldark


1 Answers

I'm using a data generator to create my TF records, in my case, I followed almost the same as suggested by the previous answer, here is my code:

tfrecord_filename = 'test_record_{}.tfrecords'
file_index_count = 0

base_path = 'image_path'

for index in range(2): # Number of splits
    writer = tf.data.experimental.TFRecordWriter(tfrecord_filename.format(index))

    serialized_features_dataset = tf.data.Dataset.from_generator(
           generator, output_types=tf.string, output_shapes=())

     writer.write(serialized_features_dataset)

I added the following code to the generator:

def generator():
    for folder in os.listdir(base_path):
        images = glob.glob(base_path+folder+ '/*.jpg')
    
        partition_factor = len(images) // 2 # split number
        partition_images = images[int(partition_factor*index):int(partition_factor*(index+1))]

    for image in partition_images:
        yield serialize_images(image, folder) # this method is where I parse the images into records
like image 83
Alvaro Leandro Cavalcante Avatar answered Nov 14 '22 22:11

Alvaro Leandro Cavalcante