I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:
str = 'Christiane heeft een lam.'
tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')
str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags
str_chunks = chunker.parse(str_tags)
print str_chunks
And the output of this program:
[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)
I was expecting Christiane to be detected as a named entity. Any help?
To perform named entity recognition with NLTK, you have to perform three steps: Convert your text to tokens using the word_tokenize() function. Find parts of speech tag for each word using the pos_tag() function. Pass the list that contains tuples of words and POS tags to the ne_chunk() function.
chunk package. Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”.
The GPE is a Tree object's label from the pre-trained ne_chunk model.
The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text. NER is the form of NLP.
The conll2002
corpus has both spanish and dutch text, so you should make sure to use the fileids
parameter, as in python train_chunker.py conll2002 --fileids ned.train
. Training on both spanish and dutch will have poor results.
The default algorithm is a Tagger based Chunker, which does not work well on conll2002. Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"):
python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With