NLTK 3 POS_TAG throws UnicodeDecodeError

Question

Hi I am trying to learn NLTK. I am new to Python as well. I am trying the following.

>>import nltk
>>nltk.pos_tag(nltk.word_tokenize("John lived in China"))

I get the following error message

Traceback (most recent call last): File "", line 1, in nltk.pos_tag(nltk.word_tokenize("John lived in California")) File "C:\Python34\lib\site-packages ltk ag__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "C:\Python34\lib\site-packages ltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I have downloaded all models available (including the maxent_treebank_pos_tagger)

The default system encoding is UTF-8

>>sys.getdefaultencoding()

I opened up the data.py file and this is the content available.

774# Load the resource.
775    opened_resource = _open(resource_url)
776if format == 'raw':
777            resource_val = opened_resource.read()
778        elif format == 'pickle':
779            resource_val = pickle.load(opened_resource)
780        elif format == 'json':
781            import json

What am I doing wrong here?

Niranjan Sonachalam · Accepted Answer

OK, I found the solution to it. Looks like a problem in the source itself. Check here

I opened up data.py and modified line 779 as below

resource_val = pickle.load(opened_resource) #old
resource_val = pickle.load(opened_resource, encoding='iso-8859-1') #new

tripleee · Answer

The fundamental problem is that NLTK 2.x is not supported for Python 3, and NLTK 3 is an on-going effort to release a fully Python 3-compatible version.

The simple workaround is to download the latest NLTK 3.x and use that instead.

If you want to participate in finishing the port to Python 3, you probably need a deeper understanding of the differences between Python 2 and Python 3; in particular, for this case, how the fundamental string type in Python 3 is a Unicode string (u'...'), not a byte string (Python 3 b'...') like in Python 2. See also http://nedbatchelder.com/text/unipain.html

FWIW, see also https://github.com/nltk/nltk/issues/169#issuecomment-12778108 for a fix identical to yours. The bug you linked to has already been fixed in NLTK 3.0 (presumably by a fix to the actual data files instead; I think in 3.0a3).

zSand · Answer

I'm coming to this late, but in case it helps someone else who comes across this, what worked for me was to decode the text before putting it into word_tokenize, i.e.:

raw_text = "John lived in China"
to_tokenize = raw_text.decode('utf-8')
tokenized = nltk.word_tokenize(to_tokenize)
output = nltk.pos_tag(tokenized)

Maybe that'll work for someone else!

NLTK 3 POS_TAG throws UnicodeDecodeError

Tags:

python-3.x

nltk

Niranjan Sonachalam

3 Answers

Niranjan Sonachalam

tripleee

zSand

Recent Activity

Donate For Us

NLTK 3 POS_TAG throws UnicodeDecodeError

Tags:

python-3.x

nltk

Niranjan Sonachalam

3 Answers

Niranjan Sonachalam

tripleee

zSand

Related questions

Recent Activity

Donate For Us