Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stanford Parser multithread usage

Stanford Parser is now 'thread-safe' as of version 2.0 (02.03.2012). I am currently running the command line tools and cannot figure out how to make use of my multiple cores by threading the program.

In the past, this question has been answered with "Stanford Parser is not thread-safe", as the FAQ still says. I am hoping to find someone who has had success threading the latest version.

I have tried using -t flag (-t10 and -tLLP) since that was all I could find in my searches, but both throw errors.

An example of a command I issue is:

java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputFormat "oneline" ./grammar/englishPCFG.ser.gz ./corpus > corpus.lex
like image 763
earthmeLon Avatar asked Feb 15 '12 01:02

earthmeLon


1 Answers

Starting with version 2.0.5, you can now easily use multiple threads with the option -nthreads k. For example, your command can be like this:

java -mx6g edu.stanford.nlp.parser.lexparser.LexicalizedParser -nthreads 4 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz file.txt > file.stp

(Releases of version 2 prior to 2013 had no way to enable multithreading from the command-line, but only when using the API.)

Internally, you can simultaneously run as many parsing threads inside one JVM process as you want. You can do this either by getting and using multiple LexicalizedParserQuery objects (via the parserQuery() method) or implicitly by calling apply(...) or parseTree(...) off one LexicalizedParser. The -nthreads k option does this for you by sending successive sentences to different parsers using the Executor framework. You can also simultaneously create multiple LexicalizedParser's, e.g., for parsing different languages.

Multiple LexicalizedparserQuery objects share the same grammar (LexicalizedParser), but the memory space savings aren't huge, as most of the memory goes to the transient structures used in chart parsing. So, if you are running lots of parsing threads concurrently, you will need to give a lot of memory to the JVM, as in the example above.

p.s. Sorry, yes, some of the documentation still needs updating. But -tLPP is one flag for specifying language-specific resources. The Stanford Parser has no -t flag.

like image 71
Christopher Manning Avatar answered Nov 09 '22 08:11

Christopher Manning