Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:

# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)

spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")

However when I swaped the last lines above with

library(sparklyr)
sc <- spark_connect(master = "yarn-client")

I get errors:

Error in start_shell(scon, list(), jars, packages) : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
    Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar'  sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out

Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
    :: modules in use:
    -----------------------------------------

Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?

like image 987
Matt Pollock Avatar asked Jun 29 '16 14:06

Matt Pollock


People also ask

Can you use Spark with YARN?

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

Do you need to install Spark on all nodes of the YARN cluster?

No, it is not necessary to install Spark on all the 3 nodes. Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node.

Why YARN is used in Spark?

Spark's YARN support allows scheduling Spark workloads on Hadoop alongside a variety of other data-processing frameworks. The talk will be a deep dive into the architecture and uses of Spark on YARN.

How do I run Spark application in cluster mode?

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.


2 Answers

Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:

  1. Set SPARK_HOME environment variable to point to the right spark home directory.
  2. Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")

See also: http://spark.rstudio.com/deployment.html

like image 157
Javier Luraschi Avatar answered Sep 28 '22 14:09

Javier Luraschi


Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.

The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.

Here's an (arbitrary) example with the key reference:

config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"

library(sparklyr)

Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')

config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"

sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')

R Bloggers Link to Article

like image 29
ProfVersaggi Avatar answered Sep 28 '22 15:09

ProfVersaggi