Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

from nltk.tag.stanford import NERTagger st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',                '/usr/share/stanford-ner/stanford-ner.jar')  r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) print(r)  

the output is:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')] 

what I want is to extract from this list all persons and organizations in this form:

Rami Eid Sony Brook University 

I tried to loop through the list of tuples:

for x,y in i:         if y == 'ORGANIZATION':             print(x) 

But this code only prints every entity one per line:

Sony  Brook  University 

With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

like image 533
user1680859 Avatar asked Jun 05 '15 10:06

user1680859


People also ask

How does Stanford use ner Tagger in Python?

Install NLTK In a new file, import NLTK and add the file paths for the Stanford NER jar file and the model from above. I also imported the StanfordNERTagger , which is the Python wrapper class in NLTK for the Stanford NER tagger. Next, initialize the tagger with the jar file path and the model file path.

Which tag is given to words that are not named entities by the Stanford NER tagger?

Each token is tagged (using our 3 class model) with either 'PERSON', 'LOCATION', 'ORGANIZATION', or 'O'. The 'O' simply stands for other, i.e., non-named entities.

What is Ne_chunk?

ne_chunk returns a nested nltk. tree. Tree object so you would have to traverse the Tree object to get to the NEs. Take a look at Named Entity Recognition with Regular Expression: NLTK >>> from nltk import ne_chunk, pos_tag, word_tokenize >>> from nltk.tree import Tree >>> >>> def get_continuous_chunks(text): ...

What is GPE in NLTK?

The GPE is a Tree object's label from the pre-trained ne_chunk model.


1 Answers

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

You have the following options:

  1. Collect runs of identically tagged words; e.g., all adjacent words tagged PERSON should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimore is about three cities, not one.) Edit: This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.

  2. Use nltk.ne_recognize(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).

  3. Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.

  4. Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

Edit: If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

from itertools import groupby for tag, chunk in groupby(netagged_words, lambda x:x[1]):     if tag != "O":         print("%-12s"%tag, " ".join(w for w, t in chunk)) 

If netagged_words is the list of (word, type) tuples in your question, this produces:

PERSON       Rami Eid ORGANIZATION Stony Brook University LOCATION     NY 

Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore is about three cities, not one.

like image 140
alexis Avatar answered Sep 24 '22 02:09

alexis