I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark). I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from). I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server. I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes. I have a very simply Python script processing data from HDFS like so: <pre class="prettyprint"><code>import simplejson as json from pyspark import SparkContext sc = SparkContext("", "Joe Counter") rrd = sc.textFile("hdfs:///tmp/twitter/json/data/") data = rrd.map(lambda line: json.loads(line)) joes = data.filter(lambda tweet: "Joe" in tweet.get("text","")) print joes.count() </code></pre> And I am running a submit command like: <pre class="prettyprint"><code>spark-submit atest.py --deploy-mode client --master yarn-client </code></pre> What can I do to ensure the job runs in parallel across the cluster?

Can you swap the arguments for the command? spark-submit --deploy-mode client --master yarn-client atest.py If you see the help text for the command: <h3>spark-submit</h3> <pre class="prettyprint"><code>Usage: spark-submit [options] <app jar | python file> </code></pre>

PySpark distributed processing on a YARN cluster

Tags:

apache-spark

pyspark

hadoop-yarn

cloudera-cdh

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).

I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).

I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.

I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.

I have a very simply Python script processing data from HDFS like so:

import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")

rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")

data = rrd.map(lambda line: json.loads(line))

joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))

print joes.count()

And I am running a submit command like:

spark-submit atest.py --deploy-mode client --master yarn-client

What can I do to ensure the job runs in parallel across the cluster?

524

asked Jan 30 '15 05:01

aaa90210

1 Answers

Can you swap the arguments for the command? spark-submit --deploy-mode client --master yarn-client atest.py

If you see the help text for the command:

spark-submit

Usage: spark-submit [options] <app jar | python file>

answered Sep 18 '22 00:09

MrChristine

Related questions
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Is Spark's KMeans unable to handle bigdata?
                            
                                Spark dataframe to arrow
                            
                                Is there a difference between OUTER & FULL_OUTER in Spark SQL?
                            
                                Calculate Cosine Similarity Spark Dataframe
                            
                                SparkSession: ActiveSession vs DefaultSession
                            
                                how to implement spark sql pagination query
                            
                                How to recommend top 10 products in Spark ALS for all the users?
                            
                                Hive UDF for selecting all except some columns
                            
                                pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
                            
                                How does Spark parallelize the processing of a 1TB file?
                            
                                How to retrieve Metrics like Output Size and Records Written from Spark UI?
                            
                                How does computing table stats in hive or impala speed up queries in Spark SQL?
                            
                                Spark Shuffle - How workers know where to pull data from
                            
                                pyspark csv at url to dataframe, without writing to disk
                            
                                Spark: Order of column arguments in repartition vs partitionBy
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition
                            
                                How do I apply schema with nullable = false to json reading
                            
                                Why does the Spark DataFrame conversion to RDD require a full re-mapping?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With