How do I split Tensorflow datasets?

2 Answers

You may use Dataset.take() and Dataset.skip():

train_size = int(0.7 * DATASET_SIZE) val_size = int(0.15 * DATASET_SIZE) test_size = int(0.15 * DATASET_SIZE)  full_dataset = tf.data.TFRecordDataset(FLAGS.input_file) full_dataset = full_dataset.shuffle() train_dataset = full_dataset.take(train_size) test_dataset = full_dataset.skip(train_size) val_dataset = test_dataset.skip(test_size) test_dataset = test_dataset.take(test_size)

For more generality, I gave an example using a 70/15/15 train/val/test split but if you don't need a test or a val set, just ignore the last 2 lines.

Take:

Creates a Dataset with at most count elements from this dataset.

Skip:

Creates a Dataset that skips count elements from this dataset.

You may also want to look into Dataset.shard():

Creates a Dataset that includes only 1/num_shards of this dataset.

186

answered Sep 22 '22 20:09

ted

This question is similar to this one and this one, and I am afraid we have not had a satisfactory answer yet.

Using take() and skip() requires knowing the dataset size. What if I don't know that, or don't want to find out?
Using shard() only gives 1 / num_shards of dataset. What if I want the rest?

I try to present a better solution below, tested on TensorFlow 2 only. Assuming you already have a shuffled dataset, you can then use filter() to split it into two:

import tensorflow as tf  all = tf.data.Dataset.from_tensor_slices(list(range(1, 21))) \         .shuffle(10, reshuffle_each_iteration=False)  test_dataset = all.enumerate() \                     .filter(lambda x,y: x % 4 == 0) \                     .map(lambda x,y: y)  train_dataset = all.enumerate() \                     .filter(lambda x,y: x % 4 != 0) \                     .map(lambda x,y: y)  for i in test_dataset:     print(i)  print()  for i in train_dataset:     print(i)

The parameter reshuffle_each_iteration=False is important. It makes sure the original dataset is shuffled once and no more. Otherwise, the two resulting sets may have some overlaps.

Use enumerate() to add an index.

Use filter(lambda x,y: x % 4 == 0) to take 1 sample out of 4. Likewise, x % 4 != 0 takes 3 out of 4.

Use map(lambda x,y: y) to strip the index and recover the original sample.

This example achieves a 75/25 split.

x % 5 == 0 and x % 5 != 0 gives a 80/20 split.

If you really want a 70/30 split, x % 10 < 3 and x % 10 >= 3 should do.

UPDATE:

As of TensorFlow 2.0.0, above code may result in some warnings due to AutoGraph's limitations. To eliminate those warnings, declare all lambda functions separately:

def is_test(x, y):     return x % 4 == 0  def is_train(x, y):     return not is_test(x, y)  recover = lambda x,y: y  test_dataset = all.enumerate() \                     .filter(is_test) \                     .map(recover)  train_dataset = all.enumerate() \                     .filter(is_train) \                     .map(recover)

This gives no warning on my machine. And making is_train() to be not is_test() is definitely a good practice.

answered Sep 23 '22 20:09

Nick Lee

Related questions
                            
                                How can I make tensorflow run on a GPU with capability 2.x?
                            
                                Visualizing output of convolutional layer in tensorflow
                            
                                How to understand loss acc val_loss val_acc in Keras model fitting
                            
                                What is the meaning of the "None" in model.summary of KERAS?
                            
                                How to use tf.while_loop() in tensorflow
                            
                                What is the difference between model.fit() an model.evaluate() in Keras?
                            
                                Adam optimizer goes haywire after 200k batches, training loss grows
                            
                                TensorFlow 'module' object has no attribute 'global_variables_initializer'
                            
                                Illegal instruction (core dumped) after running import tensorflow
                            
                                What is the best way to implement weight constraints in TensorFlow?
                            
                                Keras: How to get layer shapes in a Sequential model
                            
                                Unknown initializer: GlorotUniform when loading Keras model
                            
                                Keras difference between generator and sequence
                            
                                What is the difference between Keras and tf.keras in TensorFlow 1.1+?
                            
                                What are the differences between all these cross-entropy losses in Keras and TensorFlow?
                            
                                looking for source code of from gen_nn_ops in tensorflow
                            
                                TensorFlow operator overloading
                            
                                TensorFlow wasn't compiled to use SSE (etc.) instructions, but these are available
                            
                                TensorFlow: questions regarding tf.argmax() and tf.equal()
                            
                                keras tensorboard: plot train and validation scalars in a same figure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I split Tensorflow datasets?

Tags:

tensorflow

tensorflow-datasets

Lukas Hestermeyer

People also ask

2 Answers

ted

Nick Lee

Recent Activity

Donate For Us