After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify. Take this example from Spark: <pre class="prettyprint"><code>JavaSparkContext spark = new JavaSparkContext( new SparkConf().setJars("...").setSparkHome....); JavaRDD<String> file = spark.textFile("hdfs://..."); // step1 JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); // step2 JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); // step3 JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://..."); </code></pre> So let's say I have 3 nodes cluster, and node 1 running as master, and the above driver program has been properly jared (say application-test.jar). So now I'm running this code on the master node and I believe right after the <code>SparkContext</code> being created, the application-test.jar file will be copied to the worker nodes (and each worker will create a dir for that application). So now my question: Are step1, step2 and step3 in the example tasks that get sent over to the workers? If yes, then how does the worker execute that? Like <code>java -cp "application-test.jar" step1</code> and so on?

When you create the <code>SparkContext</code>, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like <code>flatMap</code>, <code>map</code> and <code>reduceByKey</code> in your example. When the driver quits, the executors shut down. RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions. A task is a command sent from the driver to an executor by serializing your <code>Function</code> object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition. (This is a conceptual overview. I am glossing over some details, but I hope it is helpful.) <hr> To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the <code>SparkContext</code> is constructed.

What is a task in Spark? How does the Spark worker execute the jar file?

Tags:

distributed-computing

apache-spark

After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify.

Take this example from Spark:

JavaSparkContext spark = new JavaSparkContext(   new SparkConf().setJars("...").setSparkHome....); JavaRDD<String> file = spark.textFile("hdfs://...");  // step1 JavaRDD<String> words =   file.flatMap(new FlatMapFunction<String, String>() {     public Iterable<String> call(String s) {       return Arrays.asList(s.split(" "));     }   });  // step2 JavaPairRDD<String, Integer> pairs =   words.map(new PairFunction<String, String, Integer>() {     public Tuple2<String, Integer> call(String s) {       return new Tuple2<String, Integer>(s, 1);     }   });  // step3 JavaPairRDD<String, Integer> counts =   pairs.reduceByKey(new Function2<Integer, Integer>() {     public Integer call(Integer a, Integer b) {       return a + b;     }   });  counts.saveAsTextFile("hdfs://...");

So let's say I have 3 nodes cluster, and node 1 running as master, and the above driver program has been properly jared (say application-test.jar). So now I'm running this code on the master node and I believe right after the SparkContext being created, the application-test.jar file will be copied to the worker nodes (and each worker will create a dir for that application).

So now my question: Are step1, step2 and step3 in the example tasks that get sent over to the workers? If yes, then how does the worker execute that? Like java -cp "application-test.jar" step1 and so on?

605

asked Aug 13 '14 00:08

EdwinGuo

1 Answers

When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down.

RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions.

A task is a command sent from the driver to an executor by serializing your Function object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.

_{(This is a conceptual overview. I am glossing over some details, but I hope it is helpful.)}

To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.

123

answered Sep 17 '22 15:09

Daniel Darabos

Related questions
                            
                                Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
                            
                                Is gzip format supported in Spark?
                            
                                How to read from hbase using spark
                            
                                Get the size/length of an array column
                            
                                What is RDD in spark
                            
                                spark dataframe drop duplicates and keep first
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
                            
                                Pyspark: Parse a column of json strings
                            
                                What is the difference between Apache Spark SQLContext vs HiveContext?
                            
                                Spark RDD to DataFrame python
                            
                                Efficient Count Distinct with Apache Spark
                            
                                Spark extracting values from a Row
                            
                                FetchFailedException or MetadataFetchFailedException when processing big data set
                            
                                How to debug Spark application locally?
                            
                                How do I unit test PySpark programs?
                            
                                Joining Spark dataframes on the key
                            
                                Spark 1.4 increase maxResultSize memory
                            
                                How to handle categorical features with spark-ml?
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With