Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Words using nltk from German Text

I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))

What should I do to get the list of words in the document?

An example with nltk.tokenize.WordPunctTokenizer() for the german phrase Veränderungen über einen Walzer looks like:

In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")

Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']

In this example "ä" is treated as a delimiter,even though "ü" is not.

like image 771
red Avatar asked Feb 05 '12 13:02

red


People also ask

Does NLTK support French?

Both spaCy and NLTK support English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.

What is difference between NLTK and spaCy?

NLTK is a string processing library. It takes strings as input and returns strings or lists of strings as output. Whereas, spaCy uses object-oriented approach. When we parse a text, spaCy returns document object whose words and sentences are objects themselves.


3 Answers

Call PlaintextCorpusReader with the parameter encoding='utf-8':

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')

Edit: I see... you have two separate problems here:

a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file.

# -*- coding: utf-8 -*-

All of a sudden your constants will be seen and tokenized correctly:

german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)

Second problem: It turns out that Text() does not use unicode! If you pass it a unicode string, it will try to convert it to a pure-ascii string, which of course fails on non-ascii input. Ugh.

Solution: My recommendation would be to avoid using nltk.Text entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text's own documentation).

But if you must use nltk.Text with German data, here's how: Read your data properly so it can be tokenized, but then "encode" your unicode back to a list of str. For German, it's probably safest to just use the Latin-1 encoding, but utf-8 seems to work too.

ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');

# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)
like image 170
alexis Avatar answered Oct 19 '22 14:10

alexis


Take a look at http://text-processing.com/demo/tokenize/ I'm not sure your text is getting the right encoding, since WordPunctTokenizer in the demo handles the words fine. And so does PunktWordTokenizer.

like image 35
Jacob Avatar answered Oct 19 '22 12:10

Jacob


You might try a simple regular expression. The following suffices if you want just the words; it will swallow all punctuation:

>>> import re
>>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U)
[u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer']

Note that re.U changes the meaning of \w in the RE based on the current locale, so make sure that's set correctly. I have it set to en_US.UTF-8 which is apparently good enough for your example.

Also note that "Veränderungen über einen Walzer".decode("utf-8") and u"Veränderungen über einen Walzer" are different strings.

like image 23
Fred Foo Avatar answered Oct 19 '22 12:10

Fred Foo