I have downloaded the MNIST dataset from LeCun site. What I want is to write the Python code in order to extract the gzip and read the dataset directly from the directory, meaning that I don't have to download or access to the MNIST site anymore. Desire process: Access folder/directory --> extract gzip --> read dataset (one hot encoding) How to do it? Since almost all tutorials have to access to the either the LeCun or Tensoflow site to download and read the dataset. Thanks in advance!

I will show how to load it from scratch(for better understanding), and show how to show digit image from it by <code>matplotlib.pyplot</code> <pre class="prettyprint"><code>import cPickle import gzip import numpy as np import matplotlib.pyplot as plt def load_data(): path = '../../data/mnist.pkl.gz' f = gzip.open(path, 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close() X_train, y_train = training_data[0], training_data[1] print X_train.shape, y_train.shape # (50000L, 784L) (50000L,) # get the first image and it's label img1_arr, img1_label = X_train[0], y_train[0] print img1_arr.shape, img1_label # (784L,) , 5 # reshape first image(1 D vector) to 2D dimension image img1_2d = np.reshape(img1_arr, (28, 28)) # show it plt.subplot(111) plt.imshow(img1_2d, cmap=plt.get_cmap('gray')) plt.show() </code></pre> <img src="https://i.stack.imgur.com/xLyBo.png" alt="enter image description here"> You can also vectorize label to <code>a 10-dimensional unit vector</code> by this sample function: <pre class="prettyprint"><code>def vectorized_result(label): e = np.zeros((10, 1)) e[label] = 1.0 return e </code></pre> vectorize the above label: <pre class="prettyprint"><code>print vectorized_result(img1_label) # output as below: [[ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 1.] [ 0.] [ 0.] [ 0.] [ 0.]] </code></pre> If you want to translate it to CNN input, you can reshape it like this: <pre class="prettyprint"><code>def load_data_v2(): path = '../../data/mnist.pkl.gz' f = gzip.open(path, 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close() X_train, y_train = training_data[0], training_data[1] print X_train.shape, y_train.shape # (50000L, 784L) (50000L,) X_train = np.array([np.reshape(item, (28, 28)) for item in X_train]) y_train = np.array([vectorized_result(item) for item in y_train]) print X_train.shape, y_train.shape # (50000L, 28L, 28L) (50000L, 10L, 1L) </code></pre>

How to import pre-downloaded MNIST dataset from a specific directory or folder?

Tags:

python

machine-learning

tensorflow

deep-learning

mnist

I have downloaded the MNIST dataset from LeCun site. What I want is to write the Python code in order to extract the gzip and read the dataset directly from the directory, meaning that I don't have to download or access to the MNIST site anymore.

Desire process: Access folder/directory --> extract gzip --> read dataset (one hot encoding)

How to do it? Since almost all tutorials have to access to the either the LeCun or Tensoflow site to download and read the dataset. Thanks in advance!

836

asked Jan 15 '18 05:01

Joshua

3 Answers

If you have the MNIST data extracted, then you can load it low-level with NumPy directly:

def loadMNIST( prefix, folder ):
    intType = np.dtype( 'int32' ).newbyteorder( '>' )
    nMetaDataBytes = 4 * intType.itemsize

    data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
    magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
    data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )

    labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
                          dtype = 'ubyte' )[2 * intType.itemsize:]

    return data, labels

trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )

And to convert to hot-encoding:

def toHotEncoding( classification ):
    # emulates the functionality of tf.keras.utils.to_categorical( y )
    hotEncoding = np.zeros( [ len( classification ), 
                              np.max( classification ) + 1 ] )
    hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
    return hotEncoding

trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )

150

answered Oct 23 '22 04:10

mxmlnkn

This tensorflow call

from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')

... won't download anything it if you already have the files there.

But if for some reason you wish to unzip it yourself, here's how you do it:

from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels

with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels = extract_labels(f)

with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels = extract_labels(f)

answered Oct 23 '22 05:10

Maxim

I will show how to load it from scratch(for better understanding), and show how to show digit image from it by matplotlib.pyplot

import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt

def load_data():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    # get the first image and it's label
    img1_arr, img1_label = X_train[0], y_train[0]
    print img1_arr.shape, img1_label
    # (784L,) , 5

    # reshape first image(1 D vector) to 2D dimension image
    img1_2d = np.reshape(img1_arr, (28, 28))
    # show it
    plt.subplot(111)
    plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
    plt.show()

enter image description here

You can also vectorize label to a 10-dimensional unit vector by this sample function:

def vectorized_result(label):
    e = np.zeros((10, 1))
    e[label] = 1.0
    return e

vectorize the above label:

print vectorized_result(img1_label)
# output as below:
[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]

If you want to translate it to CNN input, you can reshape it like this:

def load_data_v2():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
    y_train = np.array([vectorized_result(item) for item in y_train])

    print X_train.shape, y_train.shape
    # (50000L, 28L, 28L) (50000L, 10L, 1L)

answered Oct 23 '22 06:10

Jayhello

Related questions
                            
                                How to interpret 4 bytes as a 32-bit float using Python
                            
                                ImportError: No module named 'pandas'
                            
                                Converting Tuple of integers and strings to just a string
                            
                                Find value greater than level - Python Pandas
                            
                                With SQLAlchemy metadata reflect() how do you get an actual table object?
                            
                                Networkx: Get the distance between nodes
                            
                                Write text to file line by line [duplicate]
                            
                                Why the negative reshape (-1) in MNIST tutorial?
                            
                                Removing a character from entire data frame
                            
                                How to include libgtk2.0-dev and pkg-config in cmake when installing openCV on Ubuntu 16
                            
                                Merging dataframes keeping all items pandas
                            
                                Map unique strings to integers in Python [duplicate]
                            
                                Regex replace is taking time for millions of documents, how to make it faster?
                            
                                How to get a list of classes and functions from a python file without importing it
                            
                                Is there really an @ operator in Python to calculate dot product?
                            
                                Flask - Webserver not reloading on code change
                            
                                VS Code doesn't recognize pep8
                            
                                pandas resample interpolate is producing NaNs
                            
                                How to mock a csv file
                            
                                Installing pip on macOS does not work (anymore)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With