Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK 3 POS_TAG throws UnicodeDecodeError

Hi I am trying to learn NLTK. I am new to Python as well. I am trying the following.

>>import nltk
>>nltk.pos_tag(nltk.word_tokenize("John lived in China"))

I get the following error message

Traceback (most recent call last): File "", line 1, in nltk.pos_tag(nltk.word_tokenize("John lived in California")) File "C:\Python34\lib\site-packages\nltk\tag__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I have downloaded all models available (including the maxent_treebank_pos_tagger)

The default system encoding is UTF-8

>>sys.getdefaultencoding()

I opened up the data.py file and this is the content available.

774# Load the resource.
775    opened_resource = _open(resource_url)
776if format == 'raw':
777            resource_val = opened_resource.read()
778        elif format == 'pickle':
779            resource_val = pickle.load(opened_resource)
780        elif format == 'json':
781            import json

What am I doing wrong here?

like image 595
Niranjan Sonachalam Avatar asked Aug 31 '14 08:08

Niranjan Sonachalam


3 Answers

OK, I found the solution to it. Looks like a problem in the source itself. Check here

I opened up data.py and modified line 779 as below

resource_val = pickle.load(opened_resource) #old
resource_val = pickle.load(opened_resource, encoding='iso-8859-1') #new
like image 147
Niranjan Sonachalam Avatar answered Nov 12 '22 08:11

Niranjan Sonachalam


The fundamental problem is that NLTK 2.x is not supported for Python 3, and NLTK 3 is an on-going effort to release a fully Python 3-compatible version.

The simple workaround is to download the latest NLTK 3.x and use that instead.

If you want to participate in finishing the port to Python 3, you probably need a deeper understanding of the differences between Python 2 and Python 3; in particular, for this case, how the fundamental string type in Python 3 is a Unicode string (u'...'), not a byte string (Python 3 b'...') like in Python 2. See also http://nedbatchelder.com/text/unipain.html

FWIW, see also https://github.com/nltk/nltk/issues/169#issuecomment-12778108 for a fix identical to yours. The bug you linked to has already been fixed in NLTK 3.0 (presumably by a fix to the actual data files instead; I think in 3.0a3).

like image 2
tripleee Avatar answered Nov 12 '22 07:11

tripleee


I'm coming to this late, but in case it helps someone else who comes across this, what worked for me was to decode the text before putting it into word_tokenize, i.e.:

raw_text = "John lived in China"
to_tokenize = raw_text.decode('utf-8')
tokenized = nltk.word_tokenize(to_tokenize)
output = nltk.pos_tag(tokenized)

Maybe that'll work for someone else!

like image 2
zSand Avatar answered Nov 12 '22 08:11

zSand