Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Operations and methods to be careful about in Apache Spark?

What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:

  1. groupByKey
  2. collectAsMap

Why?

Are there other methods?

like image 348
Josh Unger Avatar asked Dec 15 '22 17:12

Josh Unger


1 Answers

There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.

If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:

  • groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
    • favor the use of aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key.
  • collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
    • If you want to trigger the computation of an rdd, favor the use of rdd.count
    • To check the data of your rdd, use bounded operations like rdd.first (first element) or rdd.take(n) for n elements
    • If you really need to do collect, use rdd.filter or rdd.reduce to reduce its cardinality
  • collectAsMap is just collect behind the scenes
  • cartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
    • consider adding keys and join in order to combine 2 rdds.
  • others?

In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.

like image 130
maasg Avatar answered Feb 01 '23 23:02

maasg