I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum):
x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
.pivot("g")
.agg(sum(raw_df("h")
.cast(DoubleType))
.alias(""), sum(raw_df("i"))
.alias("i"))
When I changed the aggregate function to first it started taking 1.5 hours. Could you please help me understand why the aggregation function is impacting the performance and how I can improve the performance?
For the best performance, specify the distinct values of your pivot column (if you know them). Otherwise, a job will be immediately launched to determine them.
like this, in as List
x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
.pivot("g",["V1","V2","V3"])
.agg(sum(raw_df("h")
.cast(DoubleType))
.alias(""), sum(raw_df("i"))
.alias("i"))
V1,V2,V3 are Distinct Values from "g" column.
The above answer will improve the performance slightly but if I'm correct using a list will give you a O(n^2) time complexity. For better efficiency store distinct would be column names into a list and use a for loop to filter and pivot one column at a time AKA (one list element at a time). This should then give you a O(n) time complexity I believe. I was able to turn a job that takes a few hours into one that runs in 2 minutes now within databricks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With