I'm trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. I've been trying to find a way to train my own NER, but I don't seem to be able to find the right resources. I have a couple of questions regarding NLTK-
I would really appreciate help in this regard
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
There are two main models used to achieve this goal: Ontology-based models and Deep Learning-based models. Ontology-based Named Entity Recognition uses a knowledge-based recognition process that relies on lists of datasets, such as a list of company names for the company category, to make inferences.
EntityRuler() allows you to create your own entities to add to a spaCy pipeline. You start by creating an instance of EntityRuler() and passing it the current pipeline, nlp . You can then call add_patterns() on the instance and pass it a dictionary of the text pattern you'd like to label with an entity.
Are you committed to using NLTK/Python? I ran into the same problems as you, and had much better results using Stanford's named-entity recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml. The process for training the classifier using your own data is very well-documented in the FAQ.
If you really need to use NLTK, I'd hit up the mailing list for some advice from other users: http://groups.google.com/group/nltk-users.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With