In NLTK there is a <code>nltk.download()</code> function to download the datasets that are comes with the NLP suite. In sklearn, it talks about loading data sets (http://scikit-learn.org/stable/datasets/) and fetching datas from http://mldata.org/ but for the rest of the datasets, the instructions were to download from the source. Where should I save the data that I've downloaded from the source? Are there any other steps after I save the data into the correct directory before I can call from my python code? Is there an example of how to download e.g. the <code>20newsgroups</code> dataset? I've pip installed sklearn and tried this but I got an <code>IOError</code>. Most probably because I haven't downloaded the dataset from the source. <pre class="prettyprint"><code>>>> from sklearn.datasets import fetch_20newsgroups >>> fetch_20newsgroups(subset='train') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 207, in fetch_20newsgroups cache_path=cache_path) File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 89, in download_20newsgroups tarfile.open(archive_path, "r:gz").extractall(path=target_dir) File "/usr/lib/python2.7/tarfile.py", line 1678, in open return func(name, filemode, fileobj, **kwargs) File "/usr/lib/python2.7/tarfile.py", line 1727, in gzopen **kwargs) File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen return cls(name, mode, fileobj, **kwargs) File "/usr/lib/python2.7/tarfile.py", line 1574, in __init__ self.firstmember = self.next() File "/usr/lib/python2.7/tarfile.py", line 2334, in next raise ReadError("empty file") tarfile.ReadError: empty file </code></pre>

A network connection problem has probably corrupted the source archive on your drive. Delete the twenty groups related files or folders from you <code>scikit_learn_data</code> folder in your user's home directory and try again. <pre class="prettyprint"><code>$ cd ~/scikit_learn_data' $ rm -rf 20news_home $ rm 20news-bydate.pkz </code></pre>

How to download datasets for sklearn? - python

Tags:

python

machine-learning

dataset

nlp

scikit-learn

In NLTK there is a nltk.download() function to download the datasets that are comes with the NLP suite.

In sklearn, it talks about loading data sets (http://scikit-learn.org/stable/datasets/) and fetching datas from http://mldata.org/ but for the rest of the datasets, the instructions were to download from the source.

Where should I save the data that I've downloaded from the source? Are there any other steps after I save the data into the correct directory before I can call from my python code?

Is there an example of how to download e.g. the 20newsgroups dataset?

I've pip installed sklearn and tried this but I got an IOError. Most probably because I haven't downloaded the dataset from the source.

Click to copy

>>> from sklearn.datasets import fetch_20newsgroups
>>> fetch_20newsgroups(subset='train')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 207, in fetch_20newsgroups
    cache_path=cache_path)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 89, in download_20newsgroups
    tarfile.open(archive_path, "r:gz").extractall(path=target_dir)
  File "/usr/lib/python2.7/tarfile.py", line 1678, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1727, in gzopen
    **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1574, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2334, in next
    raise ReadError("empty file")
tarfile.ReadError: empty file

386

asked Jan 07 '14 09:01

alvas

1 Answers

A network connection problem has probably corrupted the source archive on your drive. Delete the twenty groups related files or folders from you scikit_learn_data folder in your user's home directory and try again.

Click to copy

$ cd ~/scikit_learn_data'
$ rm -rf 20news_home
$ rm 20news-bydate.pkz

188

answered Sep 30 '22 13:09

ogrisel

Related questions
                            
                                python: regular expression search pattern for binary files (half a byte)
                            
                                Prefetch related django
                            
                                python parse html table using lxml
                            
                                Get twitter followers using tweepy and multiple API keys
                            
                                Can't run PhantomJS in python via Selenium
                            
                                Fastest way to download thousand files using python? [closed]
                            
                                Access two consecutive elements of a list in Python [duplicate]
                            
                                Difference between pandas rolling_std and np.std on a window of an array
                            
                                MultiProcessing Pipe recv blocks even when child process is defunct
                            
                                Merge CSV Files in Python with Different file names [closed]
                            
                                Pandas: select the first couple of rows in each group
                            
                                PRAW: How to get a reddit comment object with just the comment ID?
                            
                                what does [sock] = func() mean?
                            
                                PyQt4 - can't receive sender() signal / how to indentify which button is clicked and match it with the appropriate progressbar?
                            
                                Writing (and not) to global variable in Python
                            
                                create a file outside the directory in python [closed]
                            
                                How to seat everyone according to preferences?
                            
                                Use OneToOneField inlined in Django Admin
                            
                                Getting Python version using Go
                            
                                pip install error: "Unknown archive format: .whl"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With