What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:
groupByKey
collectAsMap
Why?
Are there other methods?
There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD
on the result of rdd.toDebugString
give those away.
If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:
groupByKey
requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey
-> bumm)
aggregateByKey
or reduceByKey
to apply map-side reduction before collecting values for a key. collect
materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect
-> bumm)
rdd.count
rdd.first
(first element) or rdd.take(n)
for n elementscollect
, use rdd.filter
or rdd.reduce
to reduce its cardinality collectAsMap
is just collect
behind the scenescartesian
: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
join
in order to combine 2 rdds. In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With