Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK for Named Entity Recognition

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:

sentence = "Let's meet tomorrow at 9 pm"; tokens = nltk.word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) print nltk.ne_chunk(pos_tags, binary=True) 

I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:

(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN) 

Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

like image 895
Darth.Vader Avatar asked Oct 11 '13 07:10

Darth.Vader


People also ask

What are named entities in NLTK?

Named entities are persons, locations, organizations, time expressions, etc. POS tagger does not look for the relation between the words in the document whereas NER looks for the relationship between words. The output of POS tagging is used as an input for NER.

Which is best model for named entity recognition?

As stated above, Named Entity Recognition must both identify and categorize this information. There are two main models used to achieve this goal: Ontology-based models and Deep Learning-based models.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

How do you use a named entity recognition?

So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and feed an NER model relevant training data. Then, by tagging some word and phrase samples with their corresponding entities, you'll eventually teach your NER model how to detect entities itself.


2 Answers

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.

Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.

Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

like image 183
Viktor Vojnovski Avatar answered Sep 19 '22 08:09

Viktor Vojnovski


Named entity recognition is not an easy problem, do not expect any library to be 100% accurate. You shouldn't make any conclusions about NLTK's performance based on one sentence. Here's another example:

sentence = "I went to New York to meet John Smith"; 

I get

(S   I/PRP   went/VBD   to/TO   (NE New/NNP York/NNP)   to/TO   meet/VB   (NE John/NNP Smith/NNP)) 

As you can see, NLTK does very well here. However, I couldn't get NLTK to recognise today or tomorrow as temporal expressions. You can try Stanford SUTime, it is a part of Stanford CoreNLP- I have used it before I it works quite well (it is in Java though).

like image 44
mbatchkarov Avatar answered Sep 17 '22 08:09

mbatchkarov