Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).

I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).

I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.

I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.

I have a very simply Python script processing data from HDFS like so:

import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")

rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")

data = rrd.map(lambda line: json.loads(line))

joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))

print joes.count()

And I am running a submit command like:

spark-submit atest.py --deploy-mode client --master yarn-client

What can I do to ensure the job runs in parallel across the cluster?

like image 524
aaa90210 Avatar asked Jan 30 '15 05:01

aaa90210


People also ask

Is PySpark distributed?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.

Do you need to install Spark on all nodes of the yarn cluster?

No, it is not necessary to install Spark on all the 3 nodes. Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node.


1 Answers

Can you swap the arguments for the command? spark-submit --deploy-mode client --master yarn-client atest.py

If you see the help text for the command:

spark-submit

Usage: spark-submit [options] <app jar | python file>
like image 85
MrChristine Avatar answered Sep 18 '22 00:09

MrChristine