I am trying to extract words from a german document, when I use th following method as described in the nltk tutorial, I fail to get the words with language specific special characters.
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*');
words = nltk.Text(ptcr.words(DocumentName))
What should I do to get the list of words in the document?
An example with nltk.tokenize.WordPunctTokenizer()
for the german phrase Veränderungen über einen Walzer
looks like:
In [231]: nltk.tokenize.WordPunctTokenizer().tokenize(u"Veränderungen über einen Walzer")
Out[231]: [u'Ver\xc3', u'\xa4', u'nderungen', u'\xc3\xbcber', u'einen', u'Walzer']
In this example "ä" is treated as a delimiter,even though "ü" is not.
Both spaCy and NLTK support English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.
NLTK is a string processing library. It takes strings as input and returns strings or lists of strings as output. Whereas, spaCy uses object-oriented approach. When we parse a text, spaCy returns document object whose words and sentences are objects themselves.
Call PlaintextCorpusReader with the parameter encoding='utf-8':
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8')
Edit: I see... you have two separate problems here:
a) Tokenization problem: When you test with a literal string from German, you think you are entering unicode. In fact you are telling python to take the bytes between the quotes and convert them into a unicode string. But your bytes are being misinterpreted. Fix: Add the following line at the very top of your source file.
# -*- coding: utf-8 -*-
All of a sudden your constants will be seen and tokenized correctly:
german = u"Veränderungen über einen Walzer"
print nltk.tokenize.WordPunctTokenizer().tokenize(german)
Second problem: It turns out that Text()
does not use unicode! If you
pass it a unicode string, it will try to convert it to a pure-ascii
string, which of course fails on non-ascii input. Ugh.
Solution: My recommendation would be to avoid using nltk.Text
entirely, and work with the corpus readers directly. (This is in general a good idea: See nltk.Text
's own documentation).
But if you must use nltk.Text
with German data, here's how: Read your
data properly so it can be tokenized, but then "encode" your unicode back to a list of str
. For German, it's
probably safest to just use the Latin-1 encoding, but utf-8 seems to work
too.
ptcr = nltk.corpus.PlaintextCorpusReader(Corpus, '.*', encoding='utf-8');
# Convert unicode to utf8-encoded str
coded = [ tok.encode('utf-8') for tok in ptcr.words(DocumentName) ]
words = nltk.Text(coded)
Take a look at http://text-processing.com/demo/tokenize/ I'm not sure your text is getting the right encoding, since WordPunctTokenizer in the demo handles the words fine. And so does PunktWordTokenizer.
You might try a simple regular expression. The following suffices if you want just the words; it will swallow all punctuation:
>>> import re
>>> re.findall("\w+", "Veränderungen über einen Walzer.".decode("utf-8"), re.U)
[u'Ver\xe4nderungen', u'\xfcber', u'einen', u'Walzer']
Note that re.U
changes the meaning of \w
in the RE based on the current locale, so make sure that's set correctly. I have it set to en_US.UTF-8
which is apparently good enough for your example.
Also note that "Veränderungen über einen Walzer".decode("utf-8")
and u"Veränderungen über einen Walzer"
are different strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With