What operations of spark is processed in parallel?

Tags:

I am trying to wrap my head about the whole concept of spark. I think I have a very rudimentary understanding about spark platform. From what I understand, Spark has the concept of RDDs which is a collection of "stuff" in memory so processing is faster. You transform RDDs by using methods like map and flatmaps. Since transformations are lazy, they are not processed until you call an action on the final RDD. What I am unclear about is, when you do an action, are the transformations run in parallel? Can you assign workers to do the action in parallel?

For example, lets say I have a text file that I load into an RDD,

lines = //loadRDD
lines.map(SomeFunction())
lines.count()

What is actually going on? Does SomeFunction() process a partition of the RDD? What is the parallel aspect?

529

asked Jun 29 '15 02:06

Instinct

1 Answers

lines is just a name for the RDD data structure resident in the driver which represents a partitioned list of rows. The partitions are being managed at each of your worker nodes when they are needed.

When your action count is called, Spark works backwards through the tasks to perform that action, resulting in a section of the file being read (a partition), SomeFunction being serialised and sent over the network to the workers, and executed on each row. If you have lots of workers then more than one partition can be read at a time and SomeFunction can be mapped over a partition for each worker/core.

Each worker sends the count of items for the partition it has processed back to the driver and the driver can sum up the counts from all the partitions and return the total.

Note: in your example, SomeFunction is redundant with respect to the count of items.

132

answered Oct 11 '22 01:10

Alister Lee

Related questions
                            
                                Spark ClassNotFoundException running the master
                            
                                how does pyspark broadcast variables work
                            
                                Checking for equality of RDDs
                            
                                Equivalent to getLines in Apache Spark RDD
                            
                                Spark Cassandra Connector keyBy and shuffling
                            
                                Is this a regression bug in Spark 1.3?
                            
                                Computing Pointwise Mutual Information in Spark
                            
                                Save Spark org.apache.spark.mllib.linalg.Matrix to a file
                            
                                Spark SQL - PostgreSQL JDBC Classpath Issues
                            
                                Does caching in spark streaming increase performance
                            
                                Proper way to make a Spark Fat Jar using SBT
                            
                                How to get good performance on reading cassandra partitions in spark?
                            
                                Are recursive computations with Apache Spark RDD possible?
                            
                                Spark-submit class not found exception
                            
                                Loading bigger than memory hdf5 file in pyspark
                            
                                Spark RDD's - how do they work
                            
                                What is going wrong with `unionAll` of Spark `DataFrame`?
                            
                                Pyspark --py-files doesn't work
                            
                                Spark SQL DataFrame - distinct() vs dropDuplicates()
                            
                                How to fix Connection reset by peer message from apache-spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What operations of spark is processed in parallel?

Tags:

apache-spark

rdd

spark-streaming

Instinct

People also ask

1 Answers

Alister Lee

Recent Activity

Donate For Us