I am a newbie to NLP and trying to figure out how a Named Entity Recognizer annotates named entities. I am experimenting with Stanford NER toolkit. When I use the NER on standard more formal datasets where all naming conventions are followed to represent named entities such as in newswires or news blogs, the NER annotates the entities correctly. However when I run NER with informal datasets such as twitter, where named entities might not be capitalized as should have been, The NER does not annotate the entities. The classifier that I am using is a 3-CRF serialised classifer. Can anybody let me know how can I make the NER recognize lower case entities too?? Any useful suggestions on how to hack the NER and where this improvement is to be done is greatly appreciated. Thanks in advance for all your help.
MISC is a category from the CoNLL 2003 evaluation data which is typically used to develop NER models.
Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task.
CRF is amongst the most prominent approach used for NER. A linear chain CRF confers to a labeler in which tag assignment(for present word, denoted as yᵢ) depends only on the tag of just one previous word(denoted by yᵢ₋₁).
Install NLTKIn a new file, import NLTK and add the file paths for the Stanford NER jar file and the model from above. I also imported the StanfordNERTagger , which is the Python wrapper class in NLTK for the Stanford NER tagger. Next, initialize the tagger with the jar file path and the model file path.
I know it is an old thread but hoping it will help someone. As christopher manning has replied, the way to get lowercase detected is to replace english.muc.7class.distsim.crf.ser.gz with english.muc.7class.caseless.distsim.crf.ser.gz that you can get when you unzip the core nlp caseless jar file.
For example, in my python file I have kept everything same except changing to the new file and it works perfectly (well, most of the time)
st = NERTagger('/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/classifiers/english.muc.7class.caseless.distsim.crf.ser.gz', '/Users/username/stanford-corenlp-python/stanford-ner-2014-10-26/stanford-ner.jar')
I'm afraid there isn't an easy way to get the trained models we distribute to ignore case information at runtime. So, yes, they'll usually only label capitalized names. It would be possible to train a caseless model, which would work reasonably (but not as well on cased text, since case is a big clue in English (but not in German, Chinese, Arabic, etc.).
Along with other people's suggestions. If you're using a feature-based classifier, I would definitely add in the 100-200 most common 3-4 letter substrings in people's names or making a gazzeteer under one recognized feature. There are certain patterns that are bound to show up quite a bit in personal names that don't show up very often in other types of words, like "eli."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With