Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split train data to train and validation by using tensorflow_datasets.load (TF 2.1)

I'm trying to run the following Colab project, but when I want to split the training data into validation and train parts I get this error:

KeyError: "Invalid split train[:70%]. Available splits are: ['train']"

I use the following code:

(training_set, validation_set), dataset_info = tfds.load(
'tf_flowers',
split=['train[:70%]', 'train[70%:]'],
with_info=True,
as_supervised=True,
)

How I can fix this error?

like image 804
Pouya Ahmadvand Avatar asked Jan 25 '20 02:01

Pouya Ahmadvand


People also ask

How to split training data into training and test data?

A set of training data can be split into training data and tests using train_test_split (). With this, the input data, X and Y, are divided to get eighty-20 train test splits in random order (test_size is parameter to determine a test size). In other words, train sizes can be measured by testing train speed!!

How to use train_test_split to get the validation set?

We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.

How to split training_set and validation_set in TensorFlow?

Splitting is possible by passing split parameter to tfds.load like so split="test [:70%]". With the above code the training_set has 2569 entries, while validation_set has 1101.

How to generate train/Eval/test dataset in TensorFlow?

Now Tensorflow doesn't contain any tools for that. You could use sklearn.model_selection.train_test_split to generate train/eval/test dataset, then create tf.data.Dataset respectively. sklearn requires that stuff fits in memory, TF Data does not.


1 Answers

According to the Tensorflow Dataset docs the approach you presented is now supported. Splitting is possible by passing split parameter to tfds.load like so split="test[:70%]".

(training_set, validation_set), dataset_info = tfds.load(
    'tf_flowers',
    split=['train[:70%]', 'train[70%:]'],
    with_info=True,
    as_supervised=True,
)

With the above code the training_set has 2569 entries, while validation_set has 1101.

Thank you Saman for the comment on API deprecation:
In previous Tensorflow version it was possible to use tfds.Split API which is now deprecated:

(training_set, validation_set), dataset_info = tfds.load(
    'tf_flowers',
    split=[
        tfds.Split.TRAIN.subsplit(tfds.percent[:70]),
        tfds.Split.TRAIN.subsplit(tfds.percent[70:])
    ],
    with_info=True,
    as_supervised=True,
)
like image 97
sebastian-sz Avatar answered Nov 14 '22 20:11

sebastian-sz