Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use datasets.fetch_mldata() in sklearn?

I am trying to run the following code for a brief machine learning algorithm:

import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata

dataDict = datasets.fetch_mldata('MNIST Original')

In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
    pydev_imports.execfile(file, globals, locals) #execute the script
  File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
    dataDict = datasets.fetch_mldata('MNIST Original')
  File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
    res = self.read_var_array(hdr, process)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
  File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
  File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
  File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
  File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
  File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes

I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.

TIA.

like image 744
Patthebug Avatar asked Oct 22 '13 23:10

Patthebug


People also ask

How do I download Sklearn datasets?

In NLTK there is a nltk. download() function to download the datasets that are comes with the NLP suite. In sklearn, it talks about loading data sets (http://scikit-learn.org/stable/datasets/) and fetching datas from http://mldata.org/ but for the rest of the datasets, the instructions were to download from the source.

What do the method starting with Fetch of Sklearn datasets module do?

sklearn. datasets. fetch_covtype will load the covertype dataset; it returns a dictionary-like object with the feature matrix in the data member and the target values in target. The dataset will be downloaded from the web if necessary.


4 Answers

As of version 0.20, sklearn deprecates fetch_mldata function and adds fetch_openml instead.

Download MNIST dataset with the following code:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

There are some changes to the format though. For instance, mnist['target'] is an array of string category labels (not floats as before).

like image 170
skovorodkin Avatar answered Oct 25 '22 08:10

skovorodkin


Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in

~/scikit_learn_data/mldata/mnist-original.mat
like image 40
Szymon Laszczyński Avatar answered Oct 25 '22 08:10

Szymon Laszczyński


I downloaded the dataset from this link

https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

then I typed these lines

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')

*** the path is (your working directory)/files/mldata/mnist-original.mat

I hope you get it , it worked well for me

like image 42
Soundous Bahri Avatar answered Oct 25 '22 08:10

Soundous Bahri


Here is some sample code how to get MNIST data ready to use for sklearn:

def get_data():
    """
    Get MNIST data ready to learn with.

    Returns
    -------
    dict
        With keys 'train' and 'test'. Both do have the keys 'X' (features)
        and'y' (labels)
    """
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1] - This is of mayor importance!!!
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=0.33,
                                                        random_state=42)
    data = {'train': {'X': x_train,
                      'y': y_train},
            'test': {'X': x_test,
                     'y': y_test}}
    return data
like image 30
Martin Thoma Avatar answered Oct 25 '22 08:10

Martin Thoma