Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Named Entity Recognition Data and Features

I am building a Named Entity Recognizer with a Conditional Random Field and am looking for two things:

A) An open source, English NER dataset for Person, Location, and Organization entities

B) A list of English NER features

I have already looked at the CoNLL-2003 corpus and found this is exactly what I want but it is not readily available. I have been unsuccessful in finding a list of NER features; I am trying to avoid having to hand design these features.

Thanks

like image 329
Louise Avatar asked Oct 21 '22 16:10

Louise


2 Answers

You'll find a summarized and very informative study of what is needed for NER in this paper from Ratinov & Roth. In addition, their system is completely open-source, and includes lists of named entities gathered from Wikipedia.

like image 177
eldams Avatar answered Nov 08 '22 08:11

eldams


A) Besides the MUC corpora you should check out the manually annotated sub-corpus here: http://www.americannationalcorpus.org/MASC/About.html It's free and has various document genres. It comes with tools for parsing the format in NLTK, GATE and UIMA: http://www.anc.org/MASC/Download

B) This is a very general question.. You can try n-grams, word capitalization, using word strings as features, parts of speech, etc. You can start with reading about the Stanford parser approach with CRF: http://nlp.stanford.edu/software/CRF-NER.shtml

like image 23
Yasen Avatar answered Nov 08 '22 07:11

Yasen