I'm using the MNIST and Keras for learning about CNNs. I'm downloading the MNIST database of handwritten digits under Keras API as show below. The dataset is already split in 60.000 images for training and 10.000 images for test (see Dataset - Keras Documentation).
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
How can I join the training and test sets and then separate them into 70% for training and 30% for testing?
There's no such argument in mnist.load_data
. Instead you can concatenate data via numpy
then split via sklearn
(or numpy
):
from keras.datasets import mnist
import numpy as np
from sklearn.model_selection import train_test_split
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))
train_size = 0.7
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=train_size, random_seed=2019)
Set a random seed for a reproducibility.
Via numpy
(if you don't use sklearn):
# do the same concatenation
np.random.seed(2019)
train_size = 0.7
index = np.random.rand(len(x)) < train_size # boolean index
x_train, x_test = x[index], x[~index] # index and it's negation
y_train, y_test = y[index], y[~index]
You'll get an arrays of approximately required size (~210xx instead of 21000 test size).
The source code of mnist.load_data
looks like this function just fetches this data from a URL already split as 60000 / 10000 test, so there's only a concatenation workaround.
You could also download the MNIST dataset from http://yann.lecun.com/exdb/mnist/ and preprocess it manually, and then concatenate it (as you need). But, as far as I understand, it was divided into 60000 examples for training and 10000 for testing because this splitting is used in standard benchmarks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With