Operations and methods to be careful about in Apache Spark?

Question

What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:

groupByKey
collectAsMap

Why?

Are there other methods?

maasg · Accepted Answer

There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.

If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:

groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
- favor the use of aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key.
collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
- If you want to trigger the computation of an rdd, favor the use of rdd.count
- To check the data of your rdd, use bounded operations like rdd.first (first element) or rdd.take(n) for n elements
- If you really need to do collect, use rdd.filter or rdd.reduce to reduce its cardinality
collectAsMap is just collect behind the scenes
cartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
- consider adding keys and join in order to combine 2 rdds.
others?

In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.

Operations and methods to be careful about in Apache Spark?

Tags:

apache-spark

rdd

Josh Unger

1 Answers

maasg

Recent Activity

Donate For Us

Operations and methods to be careful about in Apache Spark?

Tags:

apache-spark

rdd

Josh Unger

1 Answers

maasg

Related questions

Recent Activity

Donate For Us