Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK: set proxy server

I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples.

My web connection uses a proxy server, and I'm trying to specify the proxy address as follows:

>>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))
>>> nltk.download()

But I get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object is not callable

I decided to set up a ProxyBasicAuthHandler before calling nltk.download():

import urllib2

auth_handler = urllib2.ProxyBasicAuthHandler(urllib2.HTTPPasswordMgrWithDefaultRealm())
auth_handler.add_password(realm=None, uri='http://proxy.example.com:3128/', user='USERNAME', passwd='PASSWORD')
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)

import nltk
nltk.download()

But now I get HTTP Error 407 - Proxy Autentification Required.

The documentation says that if the proxy is set to None then this function will attempt to detect the system proxy. But it isn't working.

How can I install a sample data set for NLTK?

like image 764
ymn Avatar asked Dec 17 '12 05:12

ymn


People also ask

What is NLTK download (' WordNet ')?

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus. You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

How do I change proxy settings in Python?

To use a proxy in Python, first import the requests package. Next create a proxies dictionary that defines the HTTP and HTTPS connections. This variable should be a dictionary that maps a protocol to the proxy URL. Additionally, make a url variable set to the webpage you're scraping from.

What does NLTK download () do?

NLTK Download Server. Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages.

Where do I put NLTK data?

Command line installation If necessary, run the download command from an administrator account, or using sudo. The recommended system location is C:\nltk_data (Windows); /usr/local/share/nltk_data (Mac); and /usr/share/nltk_data (Unix).


7 Answers

There is an error with the website where you got those lines of code for your first attempt (I have seen that same error)

The line in error is

nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD'))

You need a comma to separate the arguments. The correct line should be

nltk.set_proxy('http://proxy.example.com:3128', ('USERNAME', 'PASSWORD'))

This will work just fine.

like image 127
demongolem Avatar answered Oct 05 '22 21:10

demongolem


I run NLTK 3.2.5 and python 3.6 under Windows 10 environment. I use this script :

nltk.set_proxy('http://user:[email protected]:3128')
nltk.download()
like image 37
jcpg Avatar answered Oct 05 '22 23:10

jcpg


I was too getting the same error but i got a perfectly working solution.You need to download the nltk_data MANUALLY and put it in usr/lib/nltk_data directory in linux and c:\nltk_data if you use windows .
Here are the steps you need to follow :
1.Download the nltk_data zip file from this Github link
https://github.com/nltk/nltk_data/tree/gh-pages .
2.Since data is in zip form you need to extract it .
3.Specially for ubuntu users , following command to navigate the filesystem in a handy way.
sudo nautilus it makes copy/paste process handy . Now you can copy to usr/share easily or create a folder easily .
4.Now if you are a linux user than create a folder named as nltk_data in usr/share and if you use windows than create the same in c:/ .
5.Now paste all content of nltk_data-gh-pages (which you just extracted ) in nltk_data folder you just created .
6. Now form nltk_data/packages folder copy all folder and paste it to nltk_data folder. Now you are done.

Since this is my first answer i might be not able to explain the process correctly . So if you have trouble going through these steps , please do comment .

like image 26
Ankit Maurya Avatar answered Oct 05 '22 23:10

Ankit Maurya


The options suggested above did not work for me. Here's what worked for me in my windows environment. Try removing the round braces . it works now !

nltk.set_proxy('http://proxy.example.com:3128', 'USERNAME', 'PASSWORD')
like image 34
DACW Avatar answered Oct 05 '22 21:10

DACW


I run NLTK 3.0 and python 3.4 in windows environment..and proxy authentication runs well if i remove the branch.. so use this script

nltk.set_proxy('http://proxy.example.com:3128', 'username', 'password')
like image 40
diah_stis Avatar answered Oct 05 '22 21:10

diah_stis


If you want to manually install NLTK Corpus.

1) Go to http://www.nltk.org/nltk_data/ and download your desired NLTK Corpus file.

2) Now in a Python shell check the value of nltk.data.path

3) Choose one of the path that exists on your machine, and unzip the data files into the corpora sub directory inside.

4) Now you can import the data from nltk.corpos import stopwords

Reference: https://medium.com/@satorulogic/how-to-manually-download-a-nltk-corpus-f01569861da9

like image 40
SVK Avatar answered Oct 05 '22 22:10

SVK


Set the proxy of the system in bash also by changing proper environment variable.

Some of the proxy settings which I keep are:

http_proxy=http://127.0.0.1:3129/
ftp_proxy=http://127.0.0.1:3129/
all_proxy=socks://127.0.0.1:3129/
https_proxy=http://127.0.0.1:3129/

You can make the changes in environment variable permanent by editing your ~/.bashrc file. Sample edit:

export http_proxy=http://127.0.0.1:3129/
like image 37
Sibi Avatar answered Oct 05 '22 21:10

Sibi