First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
sentences = sent_tokenize(filecontent) #break file content into sentences
for j,sent in enumerate(sentences):
words = word_tokenize(sent) #tokenize sentences into words
ne_tags = st.tag(words) #get tagged NEs from Stanford NER
This is probably due to calling st.tag()
for each sentence, but is there any way to make it run faster?
EDIT
The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)
file format:
(sent_number, orig_sentence, NE_and_lemmatized_sentence)
you can use stanford ner server. The speed will be much faster.
install sner
pip install sner
run ner server
cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
from sner import Ner
test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))
this code result is
[('Alice', 'PERSON'),
('went', 'O'),
('to', 'O'),
('the', 'O'),
('Museum', 'ORGANIZATION'),
('of', 'ORGANIZATION'),
('Natural', 'ORGANIZATION'),
('History', 'ORGANIZATION'),
('.', 'O')]
more detail to look https://github.com/caihaoyu/sner
From StanfordNERTagger, there is the tag_sents()
function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68
>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With