In NLTK there is a nltk.download()
function to download the datasets that are comes with the NLP suite.
In sklearn, it talks about loading data sets (http://scikit-learn.org/stable/datasets/) and fetching datas from http://mldata.org/ but for the rest of the datasets, the instructions were to download from the source.
Where should I save the data that I've downloaded from the source? Are there any other steps after I save the data into the correct directory before I can call from my python code?
Is there an example of how to download e.g. the 20newsgroups
dataset?
I've pip installed sklearn and tried this but I got an IOError
. Most probably because I haven't downloaded the dataset from the source.
>>> from sklearn.datasets import fetch_20newsgroups
>>> fetch_20newsgroups(subset='train')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 207, in fetch_20newsgroups
cache_path=cache_path)
File "/usr/local/lib/python2.7/dist-packages/sklearn/datasets/twenty_newsgroups.py", line 89, in download_20newsgroups
tarfile.open(archive_path, "r:gz").extractall(path=target_dir)
File "/usr/lib/python2.7/tarfile.py", line 1678, in open
return func(name, filemode, fileobj, **kwargs)
File "/usr/lib/python2.7/tarfile.py", line 1727, in gzopen
**kwargs)
File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
return cls(name, mode, fileobj, **kwargs)
File "/usr/lib/python2.7/tarfile.py", line 1574, in __init__
self.firstmember = self.next()
File "/usr/lib/python2.7/tarfile.py", line 2334, in next
raise ReadError("empty file")
tarfile.ReadError: empty file
In NLTK there is a nltk. download() function to download the datasets that are comes with the NLP suite. In sklearn, it talks about loading data sets (http://scikit-learn.org/stable/datasets/) and fetching datas from http://mldata.org/ but for the rest of the datasets, the instructions were to download from the source.
If you want to download datasets that are used in projects, you can follow these steps: Navigate to your project and click File > Open. Navigate to the folder where the datasets are stored. Select the datasets you need and click Download.
A network connection problem has probably corrupted the source archive on your drive. Delete the twenty groups related files or folders from you scikit_learn_data
folder in your user's home directory and try again.
$ cd ~/scikit_learn_data'
$ rm -rf 20news_home
$ rm 20news-bydate.pkz
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With