I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run: <pre class="prettyprint"><code>from nltk.tag.stanford import NERTagger st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanford-ner.jar') r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) print(r) </code></pre> the output is: <pre class="prettyprint"><code>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')] </code></pre> what I want is to extract from this list all persons and organizations in this form: <pre class="prettyprint"><code>Rami Eid Sony Brook University </code></pre> I tried to loop through the list of tuples: <pre class="prettyprint"><code>for x,y in i: if y == 'ORGANIZATION': print(x) </code></pre> But this code only prints every entity one per line: <pre class="prettyprint"><code>Sony Brook University </code></pre> With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer: <blockquote> Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012) </blockquote> You have the following options: <ol> <li>Collect runs of identically tagged words; e.g., all adjacent words tagged <code>PERSON</code> should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. <code>New York, Boston [and] Baltimore</code> is about three cities, not one.) Edit: This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.</li> <li>Use <code>nltk.ne_recognize()</code>. It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger). </li> <li>Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.</li> <li>Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.</li> </ol> Edit: If all you want is to pull out runs of continuous named entities (option 1 above), you should use <code>itertools.groupby</code>: <pre class="prettyprint"><code>from itertools import groupby for tag, chunk in groupby(netagged_words, lambda x:x[1]): if tag != "O": print("%-12s"%tag, " ".join(w for w, t in chunk)) </code></pre> If <code>netagged_words</code> is the list of <code>(word, type)</code> tuples in your question, this produces: <pre class="prettyprint lang-none prettyprint-override"><code>PERSON Rami Eid ORGANIZATION Stony Brook University LOCATION NY </code></pre> Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. <code>New York, Boston [and] Baltimore</code> is about three cities, not one.

Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

Tags:

python

nltk

stanford-nlp

named-entity-recognition

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

from nltk.tag.stanford import NERTagger st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',                '/usr/share/stanford-ner/stanford-ner.jar')  r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) print(r)

the output is:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

what I want is to extract from this list all persons and organizations in this form:

Rami Eid Sony Brook University

I tried to loop through the list of tuples:

for x,y in i:         if y == 'ORGANIZATION':             print(x)

But this code only prints every entity one per line:

Sony  Brook  University

With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

533

asked Jun 05 '15 10:06

user1680859

1 Answers

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

You have the following options:

Collect runs of identically tagged words; e.g., all adjacent words tagged PERSON should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimore is about three cities, not one.) Edit: This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.
Use nltk.ne_recognize(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).
Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.
Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

Edit: If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

from itertools import groupby for tag, chunk in groupby(netagged_words, lambda x:x[1]):     if tag != "O":         print("%-12s"%tag, " ".join(w for w, t in chunk))

If netagged_words is the list of (word, type) tuples in your question, this produces:

PERSON       Rami Eid ORGANIZATION Stony Brook University LOCATION     NY

Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore is about three cities, not one.

140

answered Sep 24 '22 02:09

alexis

Related questions
                            
                                How can I quickly disable a try statement in python for testing?
                            
                                AttributeError: 'module' object has no attribute 'utcnow'
                            
                                Speedup scipy griddata for multiple interpolations between two irregular grids
                            
                                How to make a histogram from a list of data
                            
                                How can I get the timezone aware date in django?
                            
                                What is the time complexity of Python List Reverse?
                            
                                Python's Passing by References [duplicate]
                            
                                DeprecationWarning: executable_path has been deprecated selenium python
                            
                                Elegant ways to return multiple values from a function
                            
                                PyQt: Show menu in a system tray application
                            
                                Does filehandle get closed automatically in Python after it goes out of scope?
                            
                                Reading .csv in Python without looping through the whole file?
                            
                                How to dynamically compose and access class attributes in Python? [duplicate]
                            
                                Why does Python not perform type conversion when concatenating strings?
                            
                                Scrapy Crawl URLs in Order
                            
                                Python AST with preserved comments
                            
                                Passing a List to Python From Command Line
                            
                                What is the formal difference between "print" and "return"? [duplicate]
                            
                                matplotlib axis label format
                            
                                Python prevent copying object as reference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With