I am trying to run the classification demo base on 20news group, I download the py file here (http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py) and run the python code as usual but got below error which says there is a network connection timeout error, I am a little confused since I can download the data file from the provided URL(https://ndownloader.figshare.com/files/5975967), does anyone know how to resolve this issue? Is there anyway I can use the manuelly downloaded data file?
Environment: Python 3.6 Ananconda 5.0.1
Quoting from scikit-learn docs:
The
sklearn.datasets.fetch_20newsgroupsfunction is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the ~/scikit_learn_data/20news_home folder and calls thesklearn.datasets.load_fileson either the training or testing set folder, or both of them.
You can use the manually downloaded file simply by extracting it to the specified folder.
Alternatively, you can specify the data folder when calling fetch_20newsgroups function by passing data_home='/path/to/data'. Change the function call to be like this:
data_train = fetch_20newsgroups(data_home='/path/to/data',
subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(data_home='/path/to/data',
subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With