I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators:
tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref
along with the shift-reduce parser model englishSR.ser.gz
. I'm mainly using CoreNLP for its co-reference resolution / named entity recognition, and as far as I'm aware I'm using the minimal set of annotators for this purpose.
What methods can I take to speed up the annotation of documents?
The other SO answers all suggest not loading the models for every document, but I'm already doing that (since the wrapper starts the server once and then passes documents/results back and forth).
The documents I am processing have an average length of 20 sentences, with some as long as 400 sentences and some as short as 1. The average parse time per sentence is 1 second. I can parse ~2500 documents per day with one single-threaded process running on one machine, but I'd like to double that (if not more).
Try setting up the Stanford CoreNLP server than loading annotators on each run. That way you can load annotators once and process documents lot faster. The first process will be slower, but the rest are lot faster. See more details for Stanford CoreNLP server.
Having said that, it is often a tradeoff between accuracy and speed. So you may want to do due diligence with other tools like NLTK and spacy to see what works best for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With