Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?
Best
Sparklyr provides a range of functions that allow you to access the Spark tools for transforming/pre-processing data. SparkR is basically a tool for running R on Spark. In order to use SparkR, we just import it into our environment and run our code.
4. Starting Up: SparkSession. Basically, SparkSession is an entry point into SparkR. Also, connects your R program to a Spark cluster.
Sparklyr is an R interface for Apache Spark that allows you to: Install and connect to Spark using YARN, Mesos, Livy or Kubernetes. Use dplyr to filter and aggregate Spark datasets and streams then bring them into R for analysis and visualization. Use MLlib, H2O, XGBoost and GraphFrames to train models at scale in ...
You can connect your R program to a Spark cluster from RStudio, R shell, Rscript or other R IDEs. To start, make sure SPARK_HOME is set in environment (you can check Sys. getenv), load the SparkR package, and call sparkR.
The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:
https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function
Since sparklyr translates R to SQL, you can only use very small set of functions in mutate
statements:
http://spark.rstudio.com/dplyr.html#sql_translation
That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).
Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr
functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.
Being a wrapper, there are some limitations to sparklyr
. For example, using copy_to()
to create a Spark dataframe does not preserve columns formatted as dates. With SparkR
, as.Dataframe()
preserves dates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With