If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?
df.select(column).distinct().collect()...
vs
df.select(collect_set(column)).first()...
collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.
df.select(collect_set(column)).first()...
This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.
df.select(column).distinct().collect()...
This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.
.distinct is better than .collect_set for your specific need
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With