I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters. <pre class="prettyprint"><code>ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*'); words = nltk.Text(ptcr.words(DocumentName)) </code></pre> What should I do to get the list of words in the document? An example with <code>nltk.tokenize.WordPunctTokenizer()</code> for the german phrase <code>Veränderungen über einen Walzer</code> looks like: <pre class="prettyprint"><code>In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer") Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer'] </code></pre> In this example "ä" is treated as a delimiter,even though "ü" is not.

Call PlaintextCorpusReader with the parameter encoding='utf-8': <pre class="prettyprint"><code>ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8') </code></pre> Edit: I see... you have two separate problems here: a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file. <pre class="prettyprint"><code># -*- coding: utf-8 -*- </code></pre> All of a sudden your constants will be seen and tokenized correctly: <pre class="prettyprint"><code>german = u"Veränderungen über einen Walzer" print nltk.tokenize.WordPunctTokenizer().tokenize(german) </code></pre> Second problem: It turns out that <code>Text()</code> does not use unicode! If you pass it a unicode string, it will try to convert it to a pure-ascii string, which of course fails on non-ascii input. Ugh. Solution: My recommendation would be to avoid using <code>nltk.Text</code> entirely, and work with the corpus readers directly. (This is in general a good idea: See <code>nltk.Text</code>'s own documentation). But if you must use <code>nltk.Text</code> with German data, here's how: Read your data properly so it can be tokenized, but then "encode" your unicode back to a list of <code>str</code>. For German, it's probably safest to just use the Latin-1 encoding, but utf-8 seems to work too. <pre class="prettyprint"><code>ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8'); # Convert unicode to utf8-encoded str coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ] words = nltk.Text(coded) </code></pre>

You might try a simple regular expression. The following suffices if you want just the words; it will swallow all punctuation: <pre class="prettyprint"><code>>>> import re >>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U) [u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer'] </code></pre> Note that <code>re.U</code> changes the meaning of <code>\w</code> in the RE based on the current locale, so make sure that's set correctly. I have it set to <code>en_US.UTF-8</code> which is apparently good enough for your example. Also note that <code>"Veränderungen über einen Walzer".decode("utf-8")</code> and <code>u"Veränderungen über einen Walzer"</code> are different strings.

Extracting Words using nltk from German Text

Tags:

python

nlp

nltk

text-mining

I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))

What should I do to get the list of words in the document?

An example with nltk.tokenize.WordPunctTokenizer() for the german phrase Veränderungen über einen Walzer looks like:

In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")

Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']

In this example "ä" is treated as a delimiter,even though "ü" is not.

771

asked Feb 05 '12 13:02

red

3 Answers

Call PlaintextCorpusReader with the parameter encoding='utf-8':

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')

Edit: I see... you have two separate problems here:

a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file.

# -*- coding: utf-8 -*-

All of a sudden your constants will be seen and tokenized correctly:

german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)

Second problem: It turns out that Text() does not use unicode! If you pass it a unicode string, it will try to convert it to a pure-ascii string, which of course fails on non-ascii input. Ugh.

Solution: My recommendation would be to avoid using nltk.Text entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text's own documentation).

But if you must use nltk.Text with German data, here's how: Read your data properly so it can be tokenized, but then "encode" your unicode back to a list of str. For German, it's probably safest to just use the Latin-1 encoding, but utf-8 seems to work too.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');

# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)

170

answered Oct 19 '22 14:10

alexis

Take a look at http://text-processing.com/demo/tokenize/ I'm not sure your text is getting the right encoding, since WordPunctTokenizer in the demo handles the words fine. And so does PunktWordTokenizer.

answered Oct 19 '22 12:10

Jacob

You might try a simple regular expression. The following suffices if you want just the words; it will swallow all punctuation:

>>> import re
>>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U)
[u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer']

Note that re.U changes the meaning of \w in the RE based on the current locale, so make sure that's set correctly. I have it set to en_US.UTF-8 which is apparently good enough for your example.

Also note that "Veränderungen über einen Walzer".decode("utf-8") and u"Veränderungen über einen Walzer" are different strings.

answered Oct 19 '22 12:10

Fred Foo

Related questions
                            
                                Solving equation using bisection method
                            
                                Python Multi-lined Artificial enums using range
                            
                                What is the difference between ' ' and " " in python? [duplicate]
                            
                                Getting full list of revisions on document level using CouchDB-Python?
                            
                                How can I get the last-modified time with python3 urllib?
                            
                                override Django get_or_create
                            
                                pylint complains about wxPython 'Too many public methods'
                            
                                Ruby alternative to Scrapy? [closed]
                            
                                How can I add a default path to look for python script files in?
                            
                                Python to read PDF files [duplicate]
                            
                                Split unicode string into 300 byte chunks without destroying characters
                            
                                Gtk.StatusIcon PopupMenu in python
                            
                                Abort execution of a module in Python
                            
                                Python -- Share a Numpy Array Between Processes?
                            
                                python base64 string decoding
                            
                                How to use django-compressor behind load balancer?
                            
                                How to install Qt documentation for PyQt demo and Qt tools
                            
                                PyQt4 names showing as undefined in eclipse, but it runs fine
                            
                                How to crawl a website/extract data into database with python?
                            
                                django admin - group permissions to edit or view models

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With