I am trying to wrap my head about the whole concept of spark. I think I have a very rudimentary understanding about spark platform. From what I understand, Spark has the concept of RDDs which is a collection of "stuff" in memory so processing is faster. You transform RDDs by using methods like map and flatmaps. Since transformations are lazy, they are not processed until you call an action on the final RDD. What I am unclear about is, when you do an action, are the transformations run in parallel? Can you assign workers to do the action
in parallel?
For example, lets say I have a text file that I load into an RDD,
lines = //loadRDD
lines.map(SomeFunction())
lines.count()
What is actually going on? Does SomeFunction() process a partition of the RDD? What is the parallel aspect?
Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a cluster or computer processors. It has easy-to-use APIs for operating on large datasets, in various programming languages. It also has APIs for transforming data, and familiar data frame APIs for manipulating semi-structured data.
UDF is an abbreviation of “user defined function” in Spark. Generally, all Spark-native functions applied on Spark DataFrame are vectorized, which takes advantage of Spark's parallel processing.
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library. The library provides a thread abstraction that you can use to create concurrent threads of execution. However, by default all of your code will run on the driver node.
In parallel processing, we take in multiple different forms of information at the same time. This is especially important in vision. For example, when you see a bus coming towards you, you see its color, shape, depth, and motion all at once. If you had to assess those things one at a time, it would take far too long.
lines
is just a name for the RDD data structure resident in the driver which represents a partitioned list of rows. The partitions
are being managed at each of your worker nodes when they are needed.
When your action count
is called, Spark works backwards through the tasks to perform that action, resulting in a section of the file being read (a partition
), SomeFunction
being serialised and sent over the network to the workers, and executed on each row. If you have lots of workers then more than one partition can be read at a time and SomeFunction
can be mapped over a partition for each worker/core.
Each worker sends the count of items for the partition it has processed back to the driver and the driver can sum up the counts from all the partitions and return the total.
Note: in your example, SomeFunction is redundant with respect to the count of items.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With