Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

foldLeft or foldRight equivalent in Spark?

In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list.

Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself

like image 769
zunior Avatar asked Aug 04 '15 15:08

zunior


People also ask

What is foldLeft in spark?

The Scala foldLeft method can be used to iterate over a data structure and perform multiple operations on a Spark DataFrame. foldLeft can be used to eliminate all whitespace in multiple columns or convert all the column names in a DataFrame to snake_case.

What is foldLeft and foldRight in Scala?

The primary difference is the order in which the fold operation iterates through the collection in question. foldLeft starts on the left side—the first item—and iterates to the right; foldRight starts on the right side—the last item—and iterates to the left. fold goes in no particular order.

Why fold left and fold right are not supported in spark?

In order for Spark to become a leader in computational speed, it needed to incorporate operational parallelism. Parallelism will ultimately be the reason foldLeft is not found on the RDD class.

What is foldLeft in Scala?

foldLeft() method is a member of TraversableOnce trait, it is used to collapse elements of collections. It navigates elements from Left to Right order. It is primarily used in recursive functions and prevents stack overflow exceptions.


1 Answers

Use aggregate instead of reduce. It allows you also to specify a "zero" value of type B and a function like the one you want: (B,A) => B. Do note that you also need to merge separate aggregations done on separate executors, so a (B, B) => B function is also required.

Alternatively, if you want this aggregation as a side effect, an option is to use an accumulator. In particular, the accumulable type allows for the result type to be of a different type than the accumulating type.

Also, if you even need to do the same with a key-value RDD, use aggregateByKey.

like image 160
Daniel Langdon Avatar answered Sep 24 '22 06:09

Daniel Langdon