I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?
There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html
The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).
The Python NLTK has access to the nltk.corpus.conll2000
corpus. Calling conll2000.iob_words()
returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.
There are about 250k total words in a newswire-style context.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With