What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:
groupByKeycollectAsMapWhy?
Are there other methods?
There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.
If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:
groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key. collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
rdd.count
rdd.first (first element) or rdd.take(n) for n elementscollect, use rdd.filter or rdd.reduce to reduce its cardinality collectAsMap is just collect behind the scenescartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
join in order to combine 2 rdds. In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With