I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples.
My web connection uses a proxy server, and I'm trying to specify the proxy address as follows:
>>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))
>>> nltk.download()
But I get an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object is not callable
I decided to set up a ProxyBasicAuthHandler
before calling nltk.download()
:
import urllib2
auth_handler = urllib2.ProxyBasicAuthHandler(urllib2.HTTPPasswordMgrWithDefaultRealm())
auth_handler.add_password(realm=None, uri='http://proxy.example.com:3128/', user='USERNAME', passwd='PASSWORD')
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
import nltk
nltk.download()
But now I get HTTP Error 407 - Proxy Autentification Required
.
The documentation says that if the proxy is set to None
then this function will attempt to detect the system proxy. But it isn't working.
How can I install a sample data set for NLTK?
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.
To use a proxy in Python, first import the requests package. Next create a proxies dictionary that defines the HTTP and HTTPS connections. This variable should be a dictionary that maps a protocol to the proxy URL. Additionally, make a url variable set to the webpage you're scraping from.
NLTK Download Server. Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages.
Command line installation If necessary, run the download command from an administrator account, or using sudo. The recommended system location is C:\nltk_data (Windows); /usr/local/share/nltk_data (Mac); and /usr/share/nltk_data (Unix).
There is an error with the website where you got those lines of code for your first attempt (I have seen that same error)
The line in error is
nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))
You need a comma to separate the arguments. The correct line should be
nltk.set_proxy('http://proxy.example.com:3128', ('USERNAME', 'PASSWORD'))
This will work just fine.
I run NLTK 3.2.5 and python 3.6 under Windows 10 environment. I use this script :
nltk.set_proxy('http://user:[email protected]:3128')
nltk.download()
I was too getting the same error but i got a perfectly working solution.You need to download the nltk_data MANUALLY and put it in usr/lib/nltk_data directory in linux and c:\nltk_data if you use windows .
Here are the steps you need to follow :
1.Download the nltk_data zip file from this Github link
https://github.com/nltk/nltk_data/tree/gh-pages .
2.Since data is in zip form you need to extract it .
3.Specially for ubuntu users , following command to navigate the filesystem in a handy way.
sudo nautilus it makes copy/paste process handy . Now you can copy to usr/share easily or create a folder easily .
4.Now if you are a linux user than create a folder named as nltk_data in usr/share and if you use windows than create the same in c:/ .
5.Now paste all content of nltk_data-gh-pages (which you just extracted ) in nltk_data folder you just created .
6. Now form nltk_data/packages folder copy all folder and paste it to nltk_data folder.
Now you are done.
Since this is my first answer i might be not able to explain the process correctly . So if you have trouble going through these steps , please do comment .
The options suggested above did not work for me. Here's what worked for me in my windows environment. Try removing the round braces . it works now !
nltk.set_proxy('http://proxy.example.com:3128', 'USERNAME', 'PASSWORD')
I run NLTK 3.0 and python 3.4 in windows environment..and proxy authentication runs well if i remove the branch.. so use this script
nltk.set_proxy('http://proxy.example.com:3128', 'username', 'password')
If you want to manually install NLTK Corpus.
1) Go to http://www.nltk.org/nltk_data/ and download your desired NLTK Corpus file.
2) Now in a Python shell check the value of nltk.data.path
3) Choose one of the path that exists on your machine, and unzip the data files into the corpora
sub directory inside.
4) Now you can import the data from nltk.corpos import stopwords
Reference: https://medium.com/@satorulogic/how-to-manually-download-a-nltk-corpus-f01569861da9
Set the proxy of the system in bash also by changing proper environment variable.
Some of the proxy settings which I keep are:
http_proxy=http://127.0.0.1:3129/
ftp_proxy=http://127.0.0.1:3129/
all_proxy=socks://127.0.0.1:3129/
https_proxy=http://127.0.0.1:3129/
You can make the changes in environment variable permanent by editing your ~/.bashrc file. Sample edit:
export http_proxy=http://127.0.0.1:3129/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With