I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split()
which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y)
It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?
The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.
The original data in a machine learning model is typically taken and split into three or four sets. The three sets commonly used are the training set, the dev set and the testing set: The training set is the portion of data used to train the model.
If the size of our dataset is between 100 to 10,00,000, then we split it in the ratio 60:20:20. That is 60% data will go to the Training Set, 20% to the Dev Set and remaining to the Test Set. The main aim of deciding the splitting ratio is that all three sets should have the general trend of our original dataset.
Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices
. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data
method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With