Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkR vs sparklyr [closed]

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

like image 476
koVex Avatar asked Sep 14 '16 15:09

koVex


People also ask

What is the difference between SparkR and Sparklyr?

Sparklyr provides a range of functions that allow you to access the Spark tools for transforming/pre-processing data. SparkR is basically a tool for running R on Spark. In order to use SparkR, we just import it into our environment and run our code.

What is the entry point into SparkR?

4. Starting Up: SparkSession. Basically, SparkSession is an entry point into SparkR. Also, connects your R program to a Spark cluster.

What is RStudio Sparklyr?

Sparklyr is an R interface for Apache Spark that allows you to: Install and connect to Spark using YARN, Mesos, Livy or Kubernetes. Use dplyr to filter and aggregate Spark datasets and streams then bring them into R for analysis and visualization. Use MLlib, H2O, XGBoost and GraphFrames to train models at scale in ...

Can I use Spark with R?

You can connect your R program to a Spark cluster from RStudio, R shell, Rscript or other R IDEs. To start, make sure SPARK_HOME is set in environment (you can check Sys. getenv), load the SparkR package, and call sparkR.


2 Answers

The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.

like image 99
Alex Vorobiev Avatar answered Sep 19 '22 15:09

Alex Vorobiev


Being a wrapper, there are some limitations to sparklyr. For example, using copy_to() to create a Spark dataframe does not preserve columns formatted as dates. With SparkR, as.Dataframe() preserves dates.

like image 38
Reuben L. Avatar answered Sep 16 '22 15:09

Reuben L.