I'm dealing with quite big time series dataset, one that prepared as SequenceExample
s is then written to a TFRecord
. This results in a quite large file (over 100GB) but I'd like to have it stored in chunks. I've tried:
file = '/path/to/tf_record_0.tfrecords'
file_index = 0
for record in dataset:
# fill the time series window, prepare the sequence_example, etc.
if os.path.exists(file) and os.path.getsize(file) > 123456789:
file = file.replace(str(file_index), str(file_index + 1))
file_index += 1
with tf.io.TFRecordWriter(file) as writer:
writer.write(sequence_example.SerializeToString())
...but since TFRecordWriter
opens files like Python's open(file, mode='w')
it overwrites itself every time it enters the with
block (apart from it being really ugly solution) and from what I've read there's no way to change that behavior.
Changing path to file
inside with
block obviously throws an error.
So my question is, is there a way to create next TFRecord
file when current reaches certain size while looping and working with my dataset? And is there a benefit of having smaller TFRecord
files anyway when I'm not dealing with any type of bottleneck apart from lack of system memory? If I'm correct Tensorflow can read it from the disk without issues (although there might be other reasons one would prefer to have multiple files anyway).
One thing I can think of is creating some sort of buffer in a list
for ready-to-be-saved sequences and creating/saving to TFRecord
once that buffer reaches some threshold.
I'm using a data generator to create my TF records, in my case, I followed almost the same as suggested by the previous answer, here is my code:
tfrecord_filename = 'test_record_{}.tfrecords'
file_index_count = 0
base_path = 'image_path'
for index in range(2): # Number of splits
writer = tf.data.experimental.TFRecordWriter(tfrecord_filename.format(index))
serialized_features_dataset = tf.data.Dataset.from_generator(
generator, output_types=tf.string, output_shapes=())
writer.write(serialized_features_dataset)
I added the following code to the generator:
def generator():
for folder in os.listdir(base_path):
images = glob.glob(base_path+folder+ '/*.jpg')
partition_factor = len(images) // 2 # split number
partition_images = images[int(partition_factor*index):int(partition_factor*(index+1))]
for image in partition_images:
yield serialize_images(image, folder) # this method is where I parse the images into records
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With