Spark collect_set vs distinct

Question

If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?

df.select(column).distinct().collect()...

vs

df.select(collect_set(column)).first()...

Samir Vyas · Accepted Answer

collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.

1. collect_set

df.select(collect_set(column)).first()...

This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.

2. distinct

df.select(column).distinct().collect()...

This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.

TLDR:

.distinct is better than .collect_set for your specific need

Spark collect_set vs distinct

Tags:

apache-spark

apache-spark-sql

Arash

1 Answers

1. collect_set

2. distinct

TLDR:

Samir Vyas

Recent Activity

Donate For Us

Spark collect_set vs distinct

Tags:

apache-spark

apache-spark-sql

Arash

1 Answers

1. collect_set

2. distinct

TLDR:

Samir Vyas

Related questions

Recent Activity

Donate For Us