How to use datasets.fetch_mldata() in sklearn?

Tags:

I am trying to run the following code for a brief machine learning algorithm:

import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata

dataDict = datasets.fetch_mldata('MNIST Original')

In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
    pydev_imports.execfile(file, globals, locals) #execute the script
  File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
    dataDict = datasets.fetch_mldata('MNIST Original')
  File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
    res = self.read_var_array(hdr, process)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
  File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
  File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
  File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
  File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
  File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes

I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.

TIA.

744

asked Oct 22 '13 23:10

Patthebug

4 Answers

As of version 0.20, sklearn deprecates fetch_mldata function and adds fetch_openml instead.

Download MNIST dataset with the following code:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

There are some changes to the format though. For instance, mnist['target'] is an array of string category labels (not floats as before).

170

answered Oct 25 '22 08:10

skovorodkin

Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in

~/scikit_learn_data/mldata/mnist-original.mat

answered Oct 25 '22 08:10

Szymon Laszczyński

I downloaded the dataset from this link

https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

then I typed these lines

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')

*** the path is (your working directory)/files/mldata/mnist-original.mat

I hope you get it , it worked well for me

answered Oct 25 '22 08:10

Soundous Bahri

Here is some sample code how to get MNIST data ready to use for sklearn:

def get_data():
    """
    Get MNIST data ready to learn with.

    Returns
    -------
    dict
        With keys 'train' and 'test'. Both do have the keys 'X' (features)
        and'y' (labels)
    """
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1] - This is of mayor importance!!!
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=0.33,
                                                        random_state=42)
    data = {'train': {'X': x_train,
                      'y': y_train},
            'test': {'X': x_test,
                     'y': y_test}}
    return data

answered Oct 25 '22 08:10

Martin Thoma

Related questions
                            
                                Compiling Python modules on Windows x64
                            
                                In Python how can one tell if a module comes from a C extension?
                            
                                Django - Delete file from amazon S3
                            
                                Show partitions on a pyspark RDD
                            
                                How to get selected option using Selenium WebDriver with Python?
                            
                                PyCharm - Unresolved library 'staticfiles'
                            
                                Pathname too long to open?
                            
                                How to paste a PNG image with transparency to another image in PIL without white pixels?
                            
                                Get all pairwise combinations from a list
                            
                                CV2: "[ WARN:0] terminating async callback" when attempting to take a picture
                            
                                How to test django caching?
                            
                                Specifying number of decimal places in Python
                            
                                Python CSV: Remove quotes from value
                            
                                python - Read file from and to specific lines of text
                            
                                Filtering Django Admin by Null/Is Not Null
                            
                                Context manager for Python's MySQLdb
                            
                                How to pass a parameter in an Outlook rule to run a Python script?
                            
                                When I use python requests to check a site, if the site redirects me to another page, will I know?
                            
                                Pandas: Reindex Unsorts Dataframe
                            
                                What does it mean to "call" a function in Python? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use datasets.fetch_mldata() in sklearn?

Tags:

python

machine-learning

numpy

Patthebug

People also ask

4 Answers

skovorodkin

Szymon Laszczyński

Soundous Bahri

Martin Thoma

Recent Activity

Donate For Us