Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark pivot groupby performance very slow

I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum):

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
                .pivot("g")
                .agg(sum(raw_df("h")
                .cast(DoubleType))
                .alias(""), sum(raw_df("i"))
                .alias("i"))

When I changed the aggregate function to first it started taking 1.5 hours. Could you please help me understand why the aggregation function is impacting the performance and how I can improve the performance?

like image 641
Geeta Singh Avatar asked May 11 '18 14:05

Geeta Singh


2 Answers

For the best performance, specify the distinct values of your pivot column (if you know them). Otherwise, a job will be immediately launched to determine them.

like this, in as List

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
    .pivot("g",["V1","V2","V3"])
    .agg(sum(raw_df("h")
    .cast(DoubleType))
    .alias(""), sum(raw_df("i"))
    .alias("i"))

V1,V2,V3 are Distinct Values from "g" column.

like image 74
Mayur Dangar Avatar answered Nov 15 '22 11:11

Mayur Dangar


The above answer will improve the performance slightly but if I'm correct using a list will give you a O(n^2) time complexity. For better efficiency store distinct would be column names into a list and use a for loop to filter and pivot one column at a time AKA (one list element at a time). This should then give you a O(n) time complexity I believe. I was able to turn a job that takes a few hours into one that runs in 2 minutes now within databricks.

like image 40
Everton Robinson Avatar answered Nov 15 '22 13:11

Everton Robinson