I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer: <pre class="prettyprint"><code>f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk' text = f.read() f.close items = text.decode('utf8') a = nltk.word_tokenize(items) </code></pre> <hr> Output: <code>[u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']</code> Punkt tokenizer seems to do better: <pre class="prettyprint"><code>f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk' text = f.read() f.close items = text.decode('utf8') a = PunktWordTokenizer().tokenize(items) </code></pre> <hr> output: <code>[u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']</code> There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end: <pre class="prettyprint"><code>import nltk nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8')) # output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k'] nltk.word_tokenize('müsli pöök rääk'.decode('utf8')) # output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k'] </code></pre>

Tokenizing unicode using nltk

Tags:

I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = nltk.word_tokenize(items)

Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k']

Punkt tokenizer seems to do better:

f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk'
text = f.read()
f.close
items = text.decode('utf8')
a = PunktWordTokenizer().tokenize(items)

output: [u'\ufeffm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

There is still '\ufeff' before the first token that i can't figure out (not that I can't remove it). What am I doing wrong? Help greatly appreciated.

607

asked Feb 10 '12 13:02

root

2 Answers

It's more likely that the \uFEFF char is part of the content read from the file. I doubt it was inserted by the tokeniser. \uFEFF at the beginning of a file is a deprecated form of Byte Order Mark. If it appears anywhere else, then it is treated as a zero width non-break space.

Was the file written by Microsoft Notepad? From the codecs module docs:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.

Try reading your file using codecs.open() instead. Note the "utf-8-sig" encoding which consumes the BOM.

import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)

Experiment:

>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>>

answered Sep 22 '22 02:09

Shawn Chin

You should make sure that you're passing unicode strings to nltk tokenizers. I get the following identical tokenizations of your string with both tokenizers on my end:

import nltk
nltk.wordpunct_tokenize('müsli pöök rääk'.decode('utf8'))
# output : [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

nltk.word_tokenize('müsli pöök rääk'.decode('utf8'))
# output: [u'm\xfcsli', u'p\xf6\xf6k', u'r\xe4\xe4k']

answered Sep 20 '22 02:09

Darius Braziunas

Related questions
                            
                                How do I create character arrays in numpy?
                            
                                WinAPI Sleep() function call sleeps for longer than expected
                            
                                How to detect the original MAC address after it has been spoofed?
                            
                                HTML in QMessageBox
                            
                                How can I find the UIPopoverController from the UIViewController being displayed in a popover?
                            
                                css chain transition animation
                            
                                Characters that must be escaped in T-SQL
                            
                                Enum Serialization Json vs XML
                            
                                Setup result for call to extension method
                            
                                'position: absolute; bottom: 0' does not work when parent position is relative
                            
                                Getting the least common multiple of an array of integers in Ruby
                            
                                Resetting GPU and driver after CUDA error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With