Named Entity Recognition Data and Features

Question

I am building a Named Entity Recognizer with a Conditional Random Field and am looking for two things:

A) An open source, English NER dataset for Person, Location, and Organization entities

B) A list of English NER features

I have already looked at the CoNLL-2003 corpus and found this is exactly what I want but it is not readily available. I have been unsuccessful in finding a list of NER features; I am trying to avoid having to hand design these features.

Thanks

eldams · Accepted Answer

You'll find a summarized and very informative study of what is needed for NER in this paper from Ratinov & Roth. In addition, their system is completely open-source, and includes lists of named entities gathered from Wikipedia.

Yasen · Answer

A) Besides the MUC corpora you should check out the manually annotated sub-corpus here: http://www.americannationalcorpus.org/MASC/About.html It's free and has various document genres. It comes with tools for parsing the format in NLTK, GATE and UIMA: http://www.anc.org/MASC/Download

B) This is a very general question.. You can try n-grams, word capitalization, using word strings as features, parts of speech, etc. You can start with reading about the Stanford parser approach with CRF: http://nlp.stanford.edu/software/CRF-NER.shtml

Named Entity Recognition Data and Features

Tags:

nlp

named-entity-recognition

Louise

2 Answers

eldams

Yasen

Recent Activity

Donate For Us

Named Entity Recognition Data and Features

Tags:

nlp

named-entity-recognition

Louise

2 Answers

eldams

Yasen

Related questions

Recent Activity

Donate For Us