Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark collect_set vs distinct

If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?

df.select(column).distinct().collect()...

vs

df.select(collect_set(column)).first()...
like image 289
Arash Avatar asked Oct 26 '25 08:10

Arash


1 Answers

collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.

1. collect_set

df.select(collect_set(column)).first()...

This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.

2. distinct

df.select(column).distinct().collect()...

This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.

TLDR:

.distinct is better than .collect_set for your specific need

like image 107
Samir Vyas Avatar answered Oct 29 '25 09:10

Samir Vyas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!