Hi I am trying to learn NLTK. I am new to Python as well. I am trying the following.
>>import nltk
>>nltk.pos_tag(nltk.word_tokenize("John lived in China"))
I get the following error message
Traceback (most recent call last): File "", line 1, in nltk.pos_tag(nltk.word_tokenize("John lived in California")) File "C:\Python34\lib\site-packages\nltk\tag__init__.py", line 100, in pos_tag tagger = load(_POS_TAGGER) File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load resource_val = pickle.load(opened_resource) UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
I have downloaded all models available (including the maxent_treebank_pos_tagger)
The default system encoding is UTF-8
>>sys.getdefaultencoding()
I opened up the data.py file and this is the content available.
774# Load the resource.
775 opened_resource = _open(resource_url)
776if format == 'raw':
777 resource_val = opened_resource.read()
778 elif format == 'pickle':
779 resource_val = pickle.load(opened_resource)
780 elif format == 'json':
781 import json
What am I doing wrong here?
OK, I found the solution to it. Looks like a problem in the source itself. Check here
I opened up data.py and modified line 779 as below
resource_val = pickle.load(opened_resource) #old
resource_val = pickle.load(opened_resource, encoding='iso-8859-1') #new
The fundamental problem is that NLTK 2.x is not supported for Python 3, and NLTK 3 is an on-going effort to release a fully Python 3-compatible version.
The simple workaround is to download the latest NLTK 3.x and use that instead.
If you want to participate in finishing the port to Python 3, you probably need a deeper understanding of the differences between Python 2 and Python 3; in particular, for this case, how the fundamental string type in Python 3 is a Unicode string (u'...'
), not a byte string (Python 3 b'...'
) like in Python 2. See also http://nedbatchelder.com/text/unipain.html
FWIW, see also https://github.com/nltk/nltk/issues/169#issuecomment-12778108 for a fix identical to yours. The bug you linked to has already been fixed in NLTK 3.0 (presumably by a fix to the actual data files instead; I think in 3.0a3).
I'm coming to this late, but in case it helps someone else who comes across this, what worked for me was to decode the text before putting it into word_tokenize, i.e.:
raw_text = "John lived in China"
to_tokenize = raw_text.decode('utf-8')
tokenized = nltk.word_tokenize(to_tokenize)
output = nltk.pos_tag(tokenized)
Maybe that'll work for someone else!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With