Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkR filterRDD and flatMap not working

After spending a long time working out how to install SparkR I think there might be some issues with the package...

Please bear in mind I am very new to spark so am not sure if i have done the right thing or not.

From a fresh EC2 ubuntu 64 bit instance I've installed R and JDK

I git cloned the apache spark repo and built it with:

git clone https://github.com/apache/spark.git
cd spark
build/mvn -DskipTests -Psparkr package

I then changed my .Rprofile to reference the R directory by including the following lines....

Sys.setenv(SPARK_HOME="/home/ubuntu/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

Then after starting R I try and run through the quick start guide given here

Below are the following steps I took...

 R> library(SparkR)
 R> sc <- sparkR.init(master="local")
 R> textFile <- SparkR:::textFile(sc, "/home/ubuntu/spark/README.md")
 R> cc <- SparkR:::count(textFile)
 R> t10 <- SparkR:::take(textFile,10)

All works fine till here... the below lines do not work...

 R> SparkR:::filterRDD(textFile, function(line){ grepl("Spark", line)})
 Error: class(objId) == "jobj" is not TRUE

 R> traceback()
 7: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
   ch), call. = FALSE, domain = NA)
 6: stopifnot(class(objId) == "jobj")
 5: callJMethod(object@jrdd, "toString")
 4: paste(callJMethod(object@jrdd, "toString"), "\n", sep = "")
 3: cat(paste(callJMethod(object@jrdd, "toString"), "\n", sep = ""))
 2: function (object) 
    standardGeneric("show")(x)
 1: function (object) 
    standardGeneric("show")(x)

Another example that doesn't work is below.

 R> SparkR:::flatMap(textFile,
         function(line) {
            strsplit(line, " ")[[1]]
               })
  Error: class(objId) == "jobj" is not TRUE

Below is my session info...

 R> > sessionInfo()
 R version 3.2.0 (2015-04-16)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 14.04.2 LTS

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base     

 other attached packages:
 [1] SparkR_1.4.0

Any help here would be greatly appreciated....

like image 228
h.l.m Avatar asked May 05 '15 15:05

h.l.m


People also ask

What is the difference between map () and flatMap () transformation?

The map() transformation takes in a function and applies it to each element in the RDD and the result of the function is a new value of each element in the resulting RDD. The flatMap() is used to produce multiple output elements for each input element.

How does flatMap work in PySpark?

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic.

Can we use flatMap with DataFrame?

Using flatMap() on Spark DataFrameflatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different.


1 Answers

So this is actually a bug in the show method of the RDD in SparkR and I've documented this at https://issues.apache.org/jira/browse/SPARK-7512

However this bug should not affect your computation in any way. So if you instead used

filteredRDD <- SparkR:::filterRDD(textFile, function(line){ grepl("Spark", line)})

then the error message should go away

like image 176
Shivaram Venkataraman Avatar answered Oct 20 '22 23:10

Shivaram Venkataraman