Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can I do to speed up Stanford CoreNLP (dcoref/ner)?

I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators:

tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref

along with the shift-reduce parser model englishSR.ser.gz. I'm mainly using CoreNLP for its co-reference resolution / named entity recognition, and as far as I'm aware I'm using the minimal set of annotators for this purpose.

What methods can I take to speed up the annotation of documents?

The other SO answers all suggest not loading the models for every document, but I'm already doing that (since the wrapper starts the server once and then passes documents/results back and forth).

The documents I am processing have an average length of 20 sentences, with some as long as 400 sentences and some as short as 1. The average parse time per sentence is 1 second. I can parse ~2500 documents per day with one single-threaded process running on one machine, but I'd like to double that (if not more).

like image 938
Ayrton Massey Avatar asked Jul 22 '15 12:07

Ayrton Massey


1 Answers

Try setting up the Stanford CoreNLP server than loading annotators on each run. That way you can load annotators once and process documents lot faster. The first process will be slower, but the rest are lot faster. See more details for Stanford CoreNLP server.

Having said that, it is often a tradeoff between accuracy and speed. So you may want to do due diligence with other tools like NLTK and spacy to see what works best for you.

like image 81
Manas Ranjan Kar Avatar answered Nov 07 '22 12:11

Manas Ranjan Kar