I am new to tensorflow, and I have started to use tensorflow 2.0
I have built a tensorflow dataset for a multi-class classification problem. Let's call this labeled_ds
. I have prepared this dataset by loading all the image files from their respective class wise directories. I have followed along the tutorial here : tensorflow guide to load image dataset
Now, I need to split labeld_ds
into three disjoint pieces : train, validation and test. I was going through the tensorflow API, but there was no example which allows to specify the split percentages. I found something in the load method, but I am not sure how to use it. Further, how can I get splits to be stratified ?
# labeled_ds contains multi class data, which is unbalanced.
train_ds, val_ds, test_ds = tf.data.Dataset.tfds.load(labeled_ds, split=["train", "validation", "test"])
I am stuck here, would appreciate any advice on how to progress from here. Thanks in advance.
Split the dataset We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set. In the function below, the test set size is the ratio of the original data we want to use as the test set.
In the previous paragraph, I mentioned the caveats in the train/test split method. In order to avoid this, we can perform something called cross validation. It's very similar to train/test split, but it's applied to more subsets. Meaning, we split our data into k subsets, and train on k-1 one of those subset.
The best and most secure way to split the data into these three sets is to have one directory for train, one for dev and one for test. For instance if you have a dataset of images, you could have a structure like this with 80% in the training set, 10% in the dev set and 10% in the test set.
The train-test split is used to estimate the performance of machine learning algorithms that are applicable for prediction-based Algorithms/Applications. This method is a fast and easy procedure to perform such that we can compare our own machine learning model results to machine results.
Please refer below code to create train, test and validation splits using tensorflow dataset "oxford_flowers102"
!pip install tensorflow==2.0.0
import tensorflow as tf
print(tf.__version__)
import tensorflow_datasets as tfds
labeled_ds, summary = tfds.load('oxford_flowers102', split='train+test+validation', with_info=True)
labeled_all_length = [i for i,_ in enumerate(labeled_ds)][-1] + 1
train_size = int(0.8 * labeled_all_length)
val_test_size = int(0.1 * labeled_all_length)
df_train = labeled_ds.take(train_size)
df_test = labeled_ds.skip(train_size)
df_val = df_test.skip(val_test_size)
df_test = df_test.take(val_test_size)
df_train_length = [i for i,_ in enumerate(df_train)][-1] + 1
df_val_length = [i for i,_ in enumerate(df_val)][-1] + 1
df_test_length = [i for i,_ in enumerate(df_test)][-1] + 1
print('Original: ', labeled_all_length)
print('Train: ', df_train_length)
print('Validation :', df_val_length)
print('Test :', df_test_length)
I had the same problem
It depends on the dataset, most of which have a train and test set. In this case you can do the following (assuming 80-10-10 split):
splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'],
data_dir=filePath)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With