Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a keras method to split data?

I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?

like image 979
CerushDope Avatar asked Feb 01 '18 15:02

CerushDope


People also ask

How do I split a dataset into two?

The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.

Which method we use to split the data in machine learning?

The original data in a machine learning model is typically taken and split into three or four sets. The three sets commonly used are the training set, the dev set and the testing set: The training set is the portion of data used to train the model.

How do you split data for deep learning?

If the size of our dataset is between 100 to 10,00,000, then we split it in the ratio 60:20:20. That is 60% data will go to the Training Set, 20% to the Dev Set and remaining to the Test Set. The main aim of deciding the splitting ratio is that all three sets should have the general trend of our original dataset.


1 Answers

Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.

Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.

ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)

An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.

The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.

like image 169
dturvene Avatar answered Sep 27 '22 21:09

dturvene