After spending a long time working out how to install SparkR I think there might be some issues with the package...
Please bear in mind I am very new to spark so am not sure if i have done the right thing or not.
From a fresh EC2 ubuntu 64 bit instance I've installed R and JDK
I git cloned the apache spark repo and built it with:
git clone https://github.com/apache/spark.git
cd spark
build/mvn -DskipTests -Psparkr package
I then changed my .Rprofile
to reference the R directory by including the following lines....
Sys.setenv(SPARK_HOME="/home/ubuntu/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
Then after starting R I try and run through the quick start guide given here
Below are the following steps I took...
R> library(SparkR)
R> sc <- sparkR.init(master="local")
R> textFile <- SparkR:::textFile(sc, "/home/ubuntu/spark/README.md")
R> cc <- SparkR:::count(textFile)
R> t10 <- SparkR:::take(textFile,10)
All works fine till here... the below lines do not work...
R> SparkR:::filterRDD(textFile, function(line){ grepl("Spark", line)})
Error: class(objId) == "jobj" is not TRUE
R> traceback()
7: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
6: stopifnot(class(objId) == "jobj")
5: callJMethod(object@jrdd, "toString")
4: paste(callJMethod(object@jrdd, "toString"), "\n", sep = "")
3: cat(paste(callJMethod(object@jrdd, "toString"), "\n", sep = ""))
2: function (object)
standardGeneric("show")(x)
1: function (object)
standardGeneric("show")(x)
Another example that doesn't work is below.
R> SparkR:::flatMap(textFile,
function(line) {
strsplit(line, " ")[[1]]
})
Error: class(objId) == "jobj" is not TRUE
Below is my session info...
R> > sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SparkR_1.4.0
Any help here would be greatly appreciated....
The map() transformation takes in a function and applies it to each element in the RDD and the result of the function is a new value of each element in the resulting RDD. The flatMap() is used to produce multiple output elements for each input element.
A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic.
Using flatMap() on Spark DataFrameflatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different.
So this is actually a bug in the show method of the RDD in SparkR and I've documented this at https://issues.apache.org/jira/browse/SPARK-7512
However this bug should not affect your computation in any way. So if you instead used
filteredRDD <- SparkR:::filterRDD(textFile, function(line){ grepl("Spark", line)})
then the error message should go away
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With