Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change size of train and test set from MNIST Dataset

I'm using the MNIST and Keras for learning about CNNs. I'm downloading the MNIST database of handwritten digits under Keras API as show below. The dataset is already split in 60.000 images for training and 10.000 images for test (see Dataset - Keras Documentation).

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

How can I join the training and test sets and then separate them into 70% for training and 30% for testing?

like image 392
Thulio Amorim Avatar asked Jan 22 '19 21:01

Thulio Amorim


1 Answers

There's no such argument in mnist.load_data. Instead you can concatenate data via numpy then split via sklearn (or numpy):

from keras.datasets import mnist
import numpy as np
from sklearn.model_selection import train_test_split

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))

train_size = 0.7
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=train_size, random_seed=2019)

Set a random seed for a reproducibility.

Via numpy (if you don't use sklearn):

# do the same concatenation
np.random.seed(2019)
train_size = 0.7
index = np.random.rand(len(x)) < train_size  # boolean index
x_train, x_test = x[index], x[~index]  # index and it's negation
y_train, y_test = y[index], y[~index]

You'll get an arrays of approximately required size (~210xx instead of 21000 test size).

The source code of mnist.load_data looks like this function just fetches this data from a URL already split as 60000 / 10000 test, so there's only a concatenation workaround.

You could also download the MNIST dataset from http://yann.lecun.com/exdb/mnist/ and preprocess it manually, and then concatenate it (as you need). But, as far as I understand, it was divided into 60000 examples for training and 10000 for testing because this splitting is used in standard benchmarks.

like image 123
Mikhail Stepanov Avatar answered Sep 22 '22 01:09

Mikhail Stepanov