Why is the fold action necessary in Spark?

Tags:

I've a silly question involving fold and reduce in PySpark. I understand the difference between these two methods, but, if both need that the applied function is a commutative monoid, I cannot figure out an example in which fold cannot be substituted byreduce`.

Besides, in the PySpark implementation of fold it is used acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj)? (this second order sounds more closed to a leftFold to me)

Cheers

Tomas

340

asked Dec 30 '15 12:12

Tomas F. Pena

1 Answers

Empty RDD

It cannot be substituted when RDD is empty:

val rdd = sc.emptyRDD[Int]
rdd.reduce(_ + _)
// java.lang.UnsupportedOperationException: empty collection at   
// org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$ ...

rdd.fold(0)(_ + _)
// Int = 0

You can of course combine reduce with condition on isEmpty but it is rather ugly.

Mutable buffer

Another use case for fold is aggregation with mutable buffer. Consider following RDD:

import breeze.linalg.DenseVector

val rdd = sc.parallelize(Array.fill(100)(DenseVector(1)), 8)

Lets say we want a sum of all elements. A naive solution is to simply reduce with +:

rdd.reduce(_ + _)

Unfortunately it creates a new vector for each element. Since object creation and subsequent garbage collection is expensive it could be better to use a mutable object. It is not possible with reduce (immutability of RDD doesn't imply immutability of the elements), but can be achieved with fold as follows:

rdd.fold(DenseVector(0))((acc, x) => acc += x)

Zero element is used here as mutable buffer initialized once per partition leaving actual data untouched.

acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj)

See SPARK-6416 and SPARK-7683

answered Nov 15 '22 04:11

zero323

Related questions
                            
                                Spark Execution of TB file in memory
                            
                                Spark Redshift with Python
                            
                                Spark SQL UDF with complex input parameter
                            
                                How to extract values from json string?
                            
                                Difference Between Apache Spark SQL and MongoDB? [closed]
                            
                                How to set PYTHONHASHSEED on AWS EMR
                            
                                PySpark groupby and max value selection
                            
                                Map column values to a a numeric type in spark
                            
                                I can't understand 'RDD.map{ case (A, B) => A } ' in Scala Spark
                            
                                Passing two columns to a udf in scala?
                            
                                group by and picking up first value in spark sql [duplicate]
                            
                                How to import pyspark UDF into main class
                            
                                Whats is the correct way to sum different dataframe columns in a list in pyspark?
                            
                                How to join datasets with same columns and select one?
                            
                                Error: java.lang.IllegalArgumentException: Option 'basePath' must be a directory
                            
                                Remove all records which are duplicate in spark dataframe
                            
                                Apache Spark and Java error - Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
                            
                                Unzip folder stored in Azure Databricks FileStore
                            
                                Java - Spark SQL DataFrame map function is not working
                            
                                How do I register a function to sqlContext UDF in scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is the fold action necessary in Spark?

Tags:

reduce

apache-spark

rdd

pyspark

fold

Tomas F. Pena

People also ask

1 Answers

zero323

Recent Activity

Donate For Us