Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stanford Named Entity Recognizer (NER) functionality with NLTK

Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?

Is there any example?

In particular, I am interested in extraction LOCATION part of text. For example, from text

The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18

ideally I would like to get something like

(S  
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION

.....

or simply

22 West Westin st., South Carolina, 12345

Instead, I am only able to get

(S
  The/DT
  meeting/NN
  will/MD
  be/VB
  held/VBN
  at/IN
  22/CD
  (LOCATION West/NNP Westin/NNP)
  st./NNP
  ,/,
  (GPE South/NNP Carolina/NNP)
  ,/,
  12345/CD
  on/IN
  Nov.-18/-NONE-)

Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?

What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?

Many thanks in advance!

like image 447
bzdjamboo Avatar asked Aug 22 '13 03:08

bzdjamboo


1 Answers

nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 

output:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.

If that's a problem, use Pyner.

First run Stanford NER as a server

java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191

then go to pyner folder

import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'

Hope this helps.

like image 142
junjiah Avatar answered Oct 16 '22 09:10

junjiah