Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent processing using Stanford CoreNLP (3.5.2)

I'm facing a concurrency problem in annotating multiple sentences simultaneously. It's unclear to me whether I'm doing something wrong or maybe there is a bug in CoreNLP.

My goal is to annotate sentences with the pipeline "tokenize, ssplit, pos, lemma, ner, parse, dcoref" using several threads running in parallel. Each thread allocates its own instance of StanfordCoreNLP and then uses it for the annotation.

The problem is that at some point an exception is thrown:

java.util.ConcurrentModificationException
	at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
	at java.util.ArrayList$Itr.next(ArrayList.java:851)
	at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.<init>(GrammaticalStructure.java:201)
	at edu.stanford.nlp.trees.EnglishGrammaticalStructure.<init>(EnglishGrammaticalStructure.java:89)
	at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)
	at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)
	at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
	at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)

I'm attaching a sample code of an application that reproduces the problem in about 20 seconds on my Core i3 370M laptop (Win 7 64bit, Java 1.8.0.45 64bit). This app reads an XML file of the Recognizing Textual Entailment (RTE) corpora and then parses all sentences simultaneously using standard Java concurrency classes. The path to a local RTE XML file needs to be given as a command line argument. In my tests I used the publicly available XML file here: http://www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.gz

package semante.parser.stanford.server;

import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

public class StanfordMultiThreadingTest {

	@XmlRootElement(name = "entailment-corpus")
	@XmlAccessorType (XmlAccessType.FIELD)
	public static class Corpus {
		@XmlElement(name = "pair")
		private List<Pair> pairList = new ArrayList<Pair>();

		public void addPair(Pair p) {pairList.add(p);}
		public List<Pair> getPairList() {return pairList;}
	}

	@XmlRootElement(name="pair")
	public static class Pair {

		@XmlAttribute(name = "id")
		String id;

		@XmlAttribute(name = "entailment")
		String entailment;

		@XmlElement(name = "t")
		String t;

		@XmlElement(name = "h")
		String h;

		private Pair() {}

		public Pair(int id, boolean entailment, String t, String h) {
			this();
			this.id = Integer.toString(id);
			this.entailment = entailment ? "YES" : "NO";
			this.t = t;
			this.h = h;
		}

		public String getId() {return id;}
		public String getEntailment() {return entailment;}
		public String getT() {return t;}
		public String getH() {return h;}
	}
	
	class NullStream extends OutputStream {
		@Override 
		public void write(int b) {}
	};

	private Corpus corpus;
	private Unmarshaller unmarshaller;
	private ExecutorService executor;

	public StanfordMultiThreadingTest() throws Exception {
		javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class);
		unmarshaller = jaxbCtx.createUnmarshaller();
		executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
	}

	public void readXML(String fileName) throws Exception {
		System.out.println("Reading XML - Started");
		corpus = (Corpus) unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName), StandardCharsets.UTF_8));
		System.out.println("Reading XML - Ended");
	}

	public void parseSentences() throws Exception {
		System.out.println("Parsing - Started");

		// turn pairs into a list of sentences
		List<String> sentences = new ArrayList<String>();
		for (Pair pair : corpus.getPairList()) {
			sentences.add(pair.getT());
			sentences.add(pair.getH());
		}

		// prepare the properties
		final Properties props = new Properties();
		props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

		// first run is long since models are loaded
		new StanfordCoreNLP(props);

		// to avoid the CoreNLP initialization prints (e.g. "Adding annotation pos")
		final PrintStream nullPrintStream = new PrintStream(new NullStream());
		PrintStream err = System.err;
		System.setErr(nullPrintStream);

		int totalCount = sentences.size();
		AtomicInteger counter = new AtomicInteger(0);

		// use java concurrency to parallelize the parsing
		for (String sentence : sentences) {
			executor.execute(new Runnable() {
				@Override
				public void run() {
					try {
						StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
						Annotation annotation = new Annotation(sentence);
						pipeline.annotate(annotation);
						if (counter.incrementAndGet() % 20 == 0) {
							System.out.println("Done: " + String.format("%.2f", counter.get()*100/(double)totalCount));
						};
					} catch (Exception e) {
						System.setErr(err);
						e.printStackTrace();
						System.setErr(nullPrintStream);
						executor.shutdownNow();
					}
				}
			});
		}
		executor.shutdown();
		
		System.out.println("Waiting for parsing to end.");		
		executor.awaitTermination(10, TimeUnit.MINUTES);

		System.out.println("Parsing - Ended");
	}

	public static void main(String[] args) throws Exception {
		StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest();
		smtt.readXML(args[0]);
		smtt.parseSentences();
	}

}

In my attempt to find some background information I encountered answers given by Christopher Manning and Gabor Angeli from Stanford which indicate that contemporary versions of Stanford CoreNLP should be thread-safe. However, a recent bug report on CoreNLP version 3.4.1 describes a concurrency problem. As mentioned in the title, I'm using version 3.5.2.

It's unclear to me whether the problem I'm facing is due to a bug or due to something wrong in the way I use the package. I'd appreciate it if someone more knowledgeable could shed some light on this. I hope that the sample code would be useful for reproducing the problem. Thanks!

[1]:

like image 485
Assaf Avatar asked Jun 05 '15 21:06

Assaf


1 Answers

Have you tried using the threads option? You can specify a number of threads for a single StanfordCoreNLP pipeline and then it will process sentences in parallel.

For example, if you want to process sentences on 8 cores, set the threads option to 8:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("threads", "8")
StanfordCoreNLP pipeline  = new StanfordCoreNLP(props);

Nevertheless I think your solution should also work and we'll check whether there is some concurrency bug, but using this option might solve your problem in the meantime.

like image 70
Sebastian Schuster Avatar answered Sep 20 '22 12:09

Sebastian Schuster