I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the <code>parLapply</code> function from parallel (but am open to other OS independent options). In the past I ran <code>tagPOS</code> function from the openNLP package in <code>parLapply</code> with no problem. However, the openNLP package had some recent changes that eliminated <code>tagPOS</code> and added some more flexible options. Kurt was kind enough to help me recreate the <code>tagPOS</code> function from the new package's tools. I can get the <code>lapply</code> version to work but not the parallel version. It keeps saying the nodes need more variables passed to them until it finally asks for a non-exported function from openNLP. This seems odd it would keep asking for more and more variables to be passed which tells me I'm setting up the <code>parLapply</code> incorrectly. How can I set up the <code>tagPOS</code> to operate in an parallel, OS independent fashion? <pre class="prettyprint"><code>library(openNLP) library(NLP) library(parallel) ## POS tagger tagPOS <- function(x, pos_tag_annotator, ...) { s <- as.String(x) ## Need sentence and word token annotations. word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, pos_tag_annotator, a2) ## Determine the distribution of POS tags for word tokens. a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) ## Extract token/POS pairs (all of them): easy. POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) } ## End of tagPOS function ## Set up a parallel run text.var <- c("I like it.", "This is outstanding soup!", "I really must get the recipe.") ntv <- length(text.var) PTA <- Maxent_POS_Tag_Annotator() cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2)) clusterExport(cl=cl, varlist=c("text.var", "ntv", "tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"), envir = environment()) m <- parLapply(cl, seq_len(ntv), function(i) { x <- tagPOS(text.var[i], PTA) return(x) } ) stopCluster(cl) ## Error in checkForRemoteErrors(val) : ## 3 nodes produced errors; first error: could not find function ## "Maxent_Simple_Word_Tokenizer" openNLP::Maxent_Simple_Word_Tokenizer ## >openNLP::Maxent_Simple_Word_Tokenizer ## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported ## object from 'namespace:openNLP' ## It's a non exported function openNLP:::Maxent_Simple_Word_Tokenizer ## Demo that it works with lapply lapply(seq_len(ntv), function(i) { tagPOS(text.var[i], PTA) }) lapply(text.var, function(x) { tagPOS(x, PTA) }) ## > lapply(seq_len(ntv), function(i) { ## + tagPOS(text.var[i], PTA) ## + }) ## [[1]] ## [[1]]$POStagged ## [1] "I/PRP like/IN it/PRP ./." ## ## [[1]]$POStags ## [1] "PRP" "IN" "PRP" "." ## ## [[1]]$word.count ## [1] 3 ## ## ## [[2]] ## [[2]]$POStagged ## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/." ## ## [[2]]$POStags ## [1] "DT" "VBZ" "JJ" "NN" "." ## ## [[2]]$word.count ## [1] 4 ## ## ## [[3]] ## [[3]]$POStagged ## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./." ## ## [[3]]$POStags ## [1] "PRP" "RB" "MD" "VB" "DT" "NN" "." ## ## [[3]]$word.count ## [1] 6 </code></pre> EDIT: per Steve's suggestion Note the openNLP is brand new. I installed ver 2.1 from a tar.gz from CRAN. I get the following error even though this function exists. <pre class="prettyprint"><code>library(openNLP); library(NLP); library(parallel) tagPOS <- function(text.var, pos_tag_annotator, ...) { s <- as.String(text.var) ## Set up the POS annotator if missing (for parallel) if (missing(pos_tag_annotator)) { PTA <- Maxent_POS_Tag_Annotator() } ## Need sentence and word token annotations. word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, PTA, a2) ## Determine the distribution of POS tags for word tokens. a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, "[[", "POS")) ## Extract token/POS pairs (all of them): easy. POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) } text.var <- c("I like it.", "This is outstanding soup!", "I really must get the recipe.") cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2)) clusterEvalQ(cl, {library(openNLP); library(NLP)}) m <- parLapply(cl, text.var, tagPOS) ## > m <- parLapply(cl, text.var, tagPOS) ## Error in checkForRemoteErrors(val) : ## 3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator" stopCluster(cl) > packageDescription('openNLP') Package: openNLP Encoding: UTF-8 Version: 0.2-1 Title: Apache OpenNLP Tools Interface Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email = "Kurt.Hornik@R-project.org") Description: An interface to the Apache OpenNLP tools (version 1.5.3). The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text written in Java. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. See http://opennlp.apache.org/ for more information. Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3) SystemRequirements: Java (>= 5.0) License: GPL-3 Packaged: 2013-08-20 13:23:54 UTC; hornik Author: Kurt Hornik [aut, cre] Maintainer: Kurt Hornik <Kurt.Hornik@R-project.org> NeedsCompilation: no Repository: CRAN Date/Publication: 2013-08-20 15:41:22 Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows </code></pre>

Since you're calling functions from <code>NLP</code> on the cluster workers, you should load it on each of the workers before calling <code>parLapply</code>. You can do that from the worker function, but I tend to use <code>clusterCall</code> or <code>clusterEvalQ</code> right after creating the cluster object: <pre class="prettyprint"><code>clusterEvalQ(cl, {library(openNLP); library(NLP)}) </code></pre> Since <code>as.String</code> and <code>Maxent_Word_Token_Annotator</code> are in those packages, they shouldn't be exported. Note that while running your example on my machine, I noticed that the <code>PTA</code> object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using <code>clusterEvalQ</code>, the example ran successfully. Here it is, using openNLP 0.2-1: <pre class="prettyprint"><code>library(parallel) tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, PTA, a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) } text.var <- c("I like it.", "This is outstanding soup!", "I really must get the recipe.") cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2)) clusterEvalQ(cl, { library(openNLP) library(NLP) PTA <- Maxent_POS_Tag_Annotator() }) m <- parLapply(cl, text.var, tagPOS) print(m) stopCluster(cl) </code></pre> If <code>clusterEvalQ</code> fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing <code>sessionInfo</code> with <code>clusterEvalQ</code>: <pre class="prettyprint"><code>library(parallel) cl <- makeCluster(2) clusterEvalQ(cl, {library(openNLP); library(NLP)}) clusterEvalQ(cl, sessionInfo()) </code></pre> This will return the results of executing <code>sessionInfo()</code> on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me: <pre class="prettyprint"><code>other attached packages: [1] NLP_0.1-0 openNLP_0.2-1 loaded via a namespace (and not attached): [1] openNLPdata_1.5.3-1 rJava_0.9-4 </code></pre>

parallel parLapply setup

Tags:

r

parallel-processing

I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the parLapply function from parallel (but am open to other OS independent options). In the past I ran tagPOS function from the openNLP package in parLapply with no problem. However, the openNLP package had some recent changes that eliminated tagPOS and added some more flexible options. Kurt was kind enough to help me recreate the tagPOS function from the new package's tools. I can get the lapply version to work but not the parallel version. It keeps saying the nodes need more variables passed to them until it finally asks for a non-exported function from openNLP. This seems odd it would keep asking for more and more variables to be passed which tells me I'm setting up the parLapply incorrectly. How can I set up the tagPOS to operate in an parallel, OS independent fashion?

library(openNLP)
library(NLP)
library(parallel)

## POS tagger
tagPOS <-  function(x, pos_tag_annotator, ...) {
    s <- as.String(x)
    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, pos_tag_annotator, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
} ## End of tagPOS function 

## Set up a parallel run
text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")
ntv <- length(text.var)
PTA <- Maxent_POS_Tag_Annotator()   

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterExport(cl=cl, varlist=c("text.var", "ntv", 
    "tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"), 
    envir = environment())
m <- parLapply(cl, seq_len(ntv), function(i) {
        x <- tagPOS(text.var[i], PTA)
        return(x)
    }
)
stopCluster(cl)

## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function 
##   "Maxent_Simple_Word_Tokenizer"

openNLP::Maxent_Simple_Word_Tokenizer

## >openNLP::Maxent_Simple_Word_Tokenizer
## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported 
##     object from 'namespace:openNLP'

## It's a non exported function
openNLP:::Maxent_Simple_Word_Tokenizer


## Demo that it works with lapply
lapply(seq_len(ntv), function(i) {
    tagPOS(text.var[i], PTA)
})

lapply(text.var, function(x) {
    tagPOS(x, PTA)
})

## >     lapply(seq_len(ntv), function(i) {
## +         tagPOS(text.var[i], PTA)
## +     })
## [[1]]
## [[1]]$POStagged
## [1] "I/PRP like/IN it/PRP ./."
## 
## [[1]]$POStags
## [1] "PRP" "IN"  "PRP" "."  
## 
## [[1]]$word.count
## [1] 3
## 
## 
## [[2]]
## [[2]]$POStagged
## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/."
## 
## [[2]]$POStags
## [1] "DT"  "VBZ" "JJ"  "NN"  "."  
## 
## [[2]]$word.count
## [1] 4
## 
## 
## [[3]]
## [[3]]$POStagged
## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./."
## 
## [[3]]$POStags
## [1] "PRP" "RB"  "MD"  "VB"  "DT"  "NN"  "."  
## 
## [[3]]$word.count
## [1] 6

EDIT: per Steve's suggestion

Note the openNLP is brand new. I installed ver 2.1 from a tar.gz from CRAN. I get the following error even though this function exists.

library(openNLP); library(NLP); library(parallel)

tagPOS <-  function(text.var, pos_tag_annotator, ...) {
    s <- as.String(text.var)

    ## Set up the POS annotator if missing (for parallel)
    if (missing(pos_tag_annotator)) {
        PTA <- Maxent_POS_Tag_Annotator()
    }

    ## Need sentence and word token annotations.
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)

    ## Determine the distribution of POS tags for word tokens.
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, "[[", "POS"))

    ## Extract token/POS pairs (all of them): easy.
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}

text.var <- c("I like it.", "This is outstanding soup!",  
    "I really must get the recipe.")

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {library(openNLP); library(NLP)})
m <- parLapply(cl, text.var, tagPOS)

## > m <- parLapply(cl, text.var, tagPOS)
## Error in checkForRemoteErrors(val) : 
##   3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator"

stopCluster(cl)


> packageDescription('openNLP')
Package: openNLP
Encoding: UTF-8
Version: 0.2-1
Title: Apache OpenNLP Tools Interface
Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email =
          "[email protected]")
Description: An interface to the Apache OpenNLP tools (version 1.5.3).  The Apache OpenNLP
          library is a machine learning based toolkit for the processing of natural language
          text written in Java.  It supports the most common NLP tasks, such as tokenization,
          sentence segmentation, part-of-speech tagging, named entity extraction, chunking,
          parsing, and coreference resolution.  See http://opennlp.apache.org/ for more
          information.
Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3)
SystemRequirements: Java (>= 5.0)
License: GPL-3
Packaged: 2013-08-20 13:23:54 UTC; hornik
Author: Kurt Hornik [aut, cre]
Maintainer: Kurt Hornik <[email protected]>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-08-20 15:41:22
Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows

936

asked Aug 21 '13 12:08

Tyler Rinker

1 Answers

Since you're calling functions from NLP on the cluster workers, you should load it on each of the workers before calling parLapply. You can do that from the worker function, but I tend to use clusterCall or clusterEvalQ right after creating the cluster object:

clusterEvalQ(cl, {library(openNLP); library(NLP)})

Since as.String and Maxent_Word_Token_Annotator are in those packages, they shouldn't be exported.

Note that while running your example on my machine, I noticed that the PTA object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ, the example ran successfully. Here it is, using openNLP 0.2-1:

library(parallel)
tagPOS <-  function(x, ...) {
    s <- as.String(x)
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, PTA, a2)
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
    "I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
    library(openNLP)
    library(NLP)
    PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)

If clusterEvalQ fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo with clusterEvalQ:

library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())

This will return the results of executing sessionInfo() on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:

other attached packages:
[1] NLP_0.1-0     openNLP_0.2-1

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4

answered Oct 08 '22 05:10

Steve Weston

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

parallel parLapply setup

Tags:

r

parallel-processing

Tyler Rinker

People also ask

1 Answers

Steve Weston

Recent Activity

Donate For Us

parallel parLapply setup

Tags:

r

parallel-processing

Tyler Rinker

People also ask

1 Answers

Steve Weston

Related questions

Recent Activity

Donate For Us