Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What operations of spark is processed in parallel?

I am trying to wrap my head about the whole concept of spark. I think I have a very rudimentary understanding about spark platform. From what I understand, Spark has the concept of RDDs which is a collection of "stuff" in memory so processing is faster. You transform RDDs by using methods like map and flatmaps. Since transformations are lazy, they are not processed until you call an action on the final RDD. What I am unclear about is, when you do an action, are the transformations run in parallel? Can you assign workers to do the action in parallel?

For example, lets say I have a text file that I load into an RDD,

lines = //loadRDD
lines.map(SomeFunction())
lines.count()

What is actually going on? Does SomeFunction() process a partition of the RDD? What is the parallel aspect?

like image 529
Instinct Avatar asked Jun 29 '15 02:06

Instinct


People also ask

What is Spark parallel processing?

Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a cluster or computer processors. It has easy-to-use APIs for operating on large datasets, in various programming languages. It also has APIs for transforming data, and familiar data frame APIs for manipulating semi-structured data.

Does Spark UDF run in parallel?

UDF is an abbreviation of “user defined function” in Spark. Generally, all Spark-native functions applied on Spark DataFrame are vectorized, which takes advantage of Spark's parallel processing.

How the parallelism is achieved on Apache Spark platform?

One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library. The library provides a thread abstraction that you can use to create concurrent threads of execution. However, by default all of your code will run on the driver node.

What happens during parallel processing?

In parallel processing, we take in multiple different forms of information at the same time. This is especially important in vision. For example, when you see a bus coming towards you, you see its color, shape, depth, and motion all at once. If you had to assess those things one at a time, it would take far too long.


1 Answers

lines is just a name for the RDD data structure resident in the driver which represents a partitioned list of rows. The partitions are being managed at each of your worker nodes when they are needed.

When your action count is called, Spark works backwards through the tasks to perform that action, resulting in a section of the file being read (a partition), SomeFunction being serialised and sent over the network to the workers, and executed on each row. If you have lots of workers then more than one partition can be read at a time and SomeFunction can be mapped over a partition for each worker/core.

Each worker sends the count of items for the partition it has processed back to the driver and the driver can sum up the counts from all the partitions and return the total.

Note: in your example, SomeFunction is redundant with respect to the count of items.

like image 132
Alister Lee Avatar answered Oct 11 '22 01:10

Alister Lee