Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

loading EMNIST-letters dataset

I have been trying to find a way to load the EMNIST-letters dataset but without much success. I have found interesting stuff in the structure and can't wrap my head around what is happening. Here is what I mean:

I downloaded the .mat format in here

I can load the data using

import scipy.io
mat = scipy.io.loadmat('letter_data.mat') # renamed for conveniance

it is a dictionnary with the keys as follow:

dict_keys(['__header__', '__version__', '__globals__', 'dataset'])

the only key with interest is dataset, which I havent been able to gather data from. printing the shape of it give this:

>>>print(mat['dataset'].shape)
(1, 1)

I dug deeper and deeper to find a shape that looks somewhat like a real dataset and came across this:

>>>print(mat['dataset'][0][0][0][0][0][0].shape)
(124800, 784)

which is exactly what I wanted but I cant find the labels nor the test data, I tried many things but cant seem to understand the structure of this dataset.

If someone could tell me what is going on with this I would appreciate it

like image 514
Tissuebox Avatar asked Jul 01 '18 18:07

Tissuebox


4 Answers

Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0] and the array of label arrays with mat['dataset'][0][0][0][0][0][1]. For instance, print(mat['dataset'][0][0][0][0][0][0][0]) will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0]) will print the first image's label.

For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).

like image 75
Josh Payne Avatar answered Oct 24 '22 09:10

Josh Payne


@Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.

The data itself has already been split up in to a training and test set. Here's how I accessed the data:

    from scipy import io as sio
    mat = sio.loadmat('emnist-letters.mat')
    data = mat['dataset']

    X_train = data['train'][0,0]['images'][0,0]
    y_train = data['train'][0,0]['labels'][0,0]
    X_test = data['test'][0,0]['images'][0,0]
    y_test = data['test'][0,0]['labels'][0,0]

There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]) that distinguishes the original sample writer. Finally, there is another field data['mapping'], but I'm not sure what it is mapping the digits to.

In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.

    val_start = X_train.shape[0] - X_test.shape[0]
    X_val = X_train[val_start:X_train.shape[0],:]
    y_val = y_train[val_start:X_train.shape[0]]
    X_train = X_train[0:val_start,:]
    y_train = y_train[0:val_start]

If you don't want a validation set it is fine to leave these samples in the training set.

Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -

    X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')
like image 31
tlindbloom Avatar answered Oct 24 '22 09:10

tlindbloom


An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)

This lets you pip install emnist in your environment then import the datasets (they will download when you run the program for the first time).

Example from the site:

  >>> from emnist import extract_training_samples
  >>> images, labels = extract_training_samples('digits')
  >>> images.shape
  (240000, 28, 28)
  >>> labels.shape
  (240000,)

You can also list the datasets

 >>> from emnist import list_datasets
  >>> list_datasets()
  ['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']

And replace 'digits' in the first example with your choice.

This gives you all the data in numpy arrays which I have found makes things easy to work with.

like image 31
Daniel B Avatar answered Oct 24 '22 07:10

Daniel B


I suggest downloading the 'Binary format as the original MNIST dataset' from the Yann LeCun website.

Unzip the downloaded File and then with Python:

import idx2numpy

X_train = idx2numpy.convert_from_file('./emnist-letters-train-images-idx3-ubyte')
y_train = idx2numpy.convert_from_file('./emnist-letters-train-labels-idx1-ubyte')

X_test = idx2numpy.convert_from_file('./emnist-letters-test-images-idx3-ubyte')
y_test = idx2numpy.convert_from_file('./emnist-letters-test-labels-idx1-ubyte')
like image 30
Marco Cerliani Avatar answered Oct 24 '22 09:10

Marco Cerliani