I'm facing a concurrency problem in annotating multiple sentences simultaneously. It's unclear to me whether I'm doing something wrong or maybe there is a bug in CoreNLP.
My goal is to annotate sentences with the pipeline "tokenize, ssplit, pos, lemma, ner, parse, dcoref" using several threads running in parallel. Each thread allocates its own instance of StanfordCoreNLP and then uses it for the annotation.
The problem is that at some point an exception is thrown:
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
at edu.stanford.nlp.trees.GrammaticalStructure.<init>(GrammaticalStructure.java:201)
at edu.stanford.nlp.trees.EnglishGrammaticalStructure.<init>(EnglishGrammaticalStructure.java:89)
at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)
at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)
I'm attaching a sample code of an application that reproduces the problem in about 20 seconds on my Core i3 370M laptop (Win 7 64bit, Java 1.8.0.45 64bit). This app reads an XML file of the Recognizing Textual Entailment (RTE) corpora and then parses all sentences simultaneously using standard Java concurrency classes. The path to a local RTE XML file needs to be given as a command line argument. In my tests I used the publicly available XML file here: http://www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.gz
package semante.parser.stanford.server;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class StanfordMultiThreadingTest {
@XmlRootElement(name = "entailment-corpus")
@XmlAccessorType (XmlAccessType.FIELD)
public static class Corpus {
@XmlElement(name = "pair")
private List<Pair> pairList = new ArrayList<Pair>();
public void addPair(Pair p) {pairList.add(p);}
public List<Pair> getPairList() {return pairList;}
}
@XmlRootElement(name="pair")
public static class Pair {
@XmlAttribute(name = "id")
String id;
@XmlAttribute(name = "entailment")
String entailment;
@XmlElement(name = "t")
String t;
@XmlElement(name = "h")
String h;
private Pair() {}
public Pair(int id, boolean entailment, String t, String h) {
this();
this.id = Integer.toString(id);
this.entailment = entailment ? "YES" : "NO";
this.t = t;
this.h = h;
}
public String getId() {return id;}
public String getEntailment() {return entailment;}
public String getT() {return t;}
public String getH() {return h;}
}
class NullStream extends OutputStream {
@Override
public void write(int b) {}
};
private Corpus corpus;
private Unmarshaller unmarshaller;
private ExecutorService executor;
public StanfordMultiThreadingTest() throws Exception {
javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class);
unmarshaller = jaxbCtx.createUnmarshaller();
executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
}
public void readXML(String fileName) throws Exception {
System.out.println("Reading XML - Started");
corpus = (Corpus) unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName), StandardCharsets.UTF_8));
System.out.println("Reading XML - Ended");
}
public void parseSentences() throws Exception {
System.out.println("Parsing - Started");
// turn pairs into a list of sentences
List<String> sentences = new ArrayList<String>();
for (Pair pair : corpus.getPairList()) {
sentences.add(pair.getT());
sentences.add(pair.getH());
}
// prepare the properties
final Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
// first run is long since models are loaded
new StanfordCoreNLP(props);
// to avoid the CoreNLP initialization prints (e.g. "Adding annotation pos")
final PrintStream nullPrintStream = new PrintStream(new NullStream());
PrintStream err = System.err;
System.setErr(nullPrintStream);
int totalCount = sentences.size();
AtomicInteger counter = new AtomicInteger(0);
// use java concurrency to parallelize the parsing
for (String sentence : sentences) {
executor.execute(new Runnable() {
@Override
public void run() {
try {
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(sentence);
pipeline.annotate(annotation);
if (counter.incrementAndGet() % 20 == 0) {
System.out.println("Done: " + String.format("%.2f", counter.get()*100/(double)totalCount));
};
} catch (Exception e) {
System.setErr(err);
e.printStackTrace();
System.setErr(nullPrintStream);
executor.shutdownNow();
}
}
});
}
executor.shutdown();
System.out.println("Waiting for parsing to end.");
executor.awaitTermination(10, TimeUnit.MINUTES);
System.out.println("Parsing - Ended");
}
public static void main(String[] args) throws Exception {
StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest();
smtt.readXML(args[0]);
smtt.parseSentences();
}
}
In my attempt to find some background information I encountered answers given by Christopher Manning and Gabor Angeli from Stanford which indicate that contemporary versions of Stanford CoreNLP should be thread-safe. However, a recent bug report on CoreNLP version 3.4.1 describes a concurrency problem. As mentioned in the title, I'm using version 3.5.2.
It's unclear to me whether the problem I'm facing is due to a bug or due to something wrong in the way I use the package. I'd appreciate it if someone more knowledgeable could shed some light on this. I hope that the sample code would be useful for reproducing the problem. Thanks!
[1]:
Have you tried using the threads
option? You can specify a number of threads for a single StanfordCoreNLP
pipeline and then it will process sentences in parallel.
For example, if you want to process sentences on 8 cores, set the threads
option to 8
:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("threads", "8")
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Nevertheless I think your solution should also work and we'll check whether there is some concurrency bug, but using this option might solve your problem in the meantime.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With