Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

should we use groupBy on dataframe or reduceBy [duplicate]

While groupBy the dataframe in apache spark and later using aggregation with another column in the dataframe. Is there any performance issue? Can reduceBy be a better option?

df.groupBy("primaryKey").agg(max("another column"))
like image 269
Nsp Avatar asked Sep 20 '25 17:09

Nsp


1 Answers

In groupBy, reduce job will execute sequentially but in reduceByKey, internally spark runs multiple reduce job in parallel as it knows key and run reduce against key. ReduceByKey gives better performance than groupBy. You can run aggregation on both.

like image 84
Sagar balai Avatar answered Sep 22 '25 07:09

Sagar balai