Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing unicode using nltk

Tags:

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt tokenizer seems to do better:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

like image 607
root Avatar asked Feb 10 '12 13:02

root


People also ask

How do I remove special characters from a string in Python nltk?

Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.


2 Answers

It's more likely that the \uFEFF char is part of the content read from the file. I doubt it was inserted by the tokeniser. \uFEFF at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.

Was the file written by Microsoft Notepad? From the codecs module docs:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Try reading your file using codecs.open() instead. Note the "utf-8-sig" encoding which consumes the BOM.

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

Experiment:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>> 
like image 83
Shawn Chin Avatar answered Sep 22 '22 02:09

Shawn Chin


You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end:

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']
like image 25
Darius Braziunas Avatar answered Sep 20 '22 02:09

Darius Braziunas