Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does a flatMap in spark cause a shuffle?

Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it?

like image 759
pythonic Avatar asked Apr 04 '16 22:04

pythonic


People also ask

What triggers a shuffle in spark?

Transformations which can cause a shuffle include repartition operations like repartition and coalesce , 'ByKey operations (except for counting) like groupByKey and reduceByKey , and join operations like cogroup and join .

How does flatMap work spark?

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic.

How You Can Avoid shuffle in spark?

Joining two tables is one of the main transactions in Spark. It mostly requires shuffle which has a high cost due to data movement between nodes. If one of the tables is small enough, any shuffle operation may not be required. By broadcasting the small table to each node in the cluster, shuffle can be simply avoided.

Does distinct cause shuffle?

The groupByKey(), reduceByKey(), join(), and distinct() are some examples of wide transformations that can cause a shuffle.


1 Answers

There is no shuffling with either map or flatMap. The operations that cause shuffle are:

  • Repartition operations:
    • Repartition:
    • Coalesce:
  • ByKey operations (except for counting):
    • GroupByKey:
    • ReduceByKey:
  • Join operations:
    • Cogroup:
    • Join:

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

  • mapPartitions to sort each partition using, for example, .sorted
  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
  • sortBy to make a globally ordered RDD

More info here: http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

like image 128
JorgeGlezLopez Avatar answered Sep 20 '22 22:09

JorgeGlezLopez