I am building a Named Entity Recognizer with a Conditional Random Field and am looking for two things:
A) An open source, English NER dataset for Person, Location, and Organization entities
B) A list of English NER features
I have already looked at the CoNLL-2003 corpus and found this is exactly what I want but it is not readily available. I have been unsuccessful in finding a list of NER features; I am trying to avoid having to hand design these features.
Thanks
You'll find a summarized and very informative study of what is needed for NER in this paper from Ratinov & Roth. In addition, their system is completely open-source, and includes lists of named entities gathered from Wikipedia.
A) Besides the MUC corpora you should check out the manually annotated sub-corpus here: http://www.americannationalcorpus.org/MASC/About.html It's free and has various document genres. It comes with tools for parsing the format in NLTK, GATE and UIMA: http://www.anc.org/MASC/Download
B) This is a very general question.. You can try n-grams, word capitalization, using word strings as features, parts of speech, etc. You can start with reading about the Stanford parser approach with CRF: http://nlp.stanford.edu/software/CRF-NER.shtml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With