Spark pivot groupby performance very slow

Question

I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum):

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
                .pivot("g")
                .agg(sum(raw_df("h")
                .cast(DoubleType))
                .alias(""), sum(raw_df("i"))
                .alias("i"))

When I changed the aggregate function to first it started taking 1.5 hours. Could you please help me understand why the aggregation function is impacting the performance and how I can improve the performance?

Mayur Dangar · Accepted Answer

For the best performance, specify the distinct values of your pivot column (if you know them). Otherwise, a job will be immediately launched to determine them.

like this, in as List

x_pivot = raw_df.groupBy("a", "b", "c", "d", "e","f")
    .pivot("g",["V1","V2","V3"])
    .agg(sum(raw_df("h")
    .cast(DoubleType))
    .alias(""), sum(raw_df("i"))
    .alias("i"))

V1,V2,V3 are Distinct Values from "g" column.

Everton Robinson · Answer

The above answer will improve the performance slightly but if I'm correct using a list will give you a O(n^2) time complexity. For better efficiency store distinct would be column names into a list and use a for loop to filter and pivot one column at a time AKA (one list element at a time). This should then give you a O(n) time complexity I believe. I was able to turn a job that takes a few hours into one that runs in 2 minutes now within databricks.

Spark pivot groupby performance very slow

Tags:

dataframe

group-by

pivot

apache-spark

Geeta Singh

2 Answers

Mayur Dangar

Everton Robinson

Recent Activity

Donate For Us

Spark pivot groupby performance very slow

Tags:

dataframe

group-by

pivot

apache-spark

Geeta Singh

2 Answers

Mayur Dangar

Everton Robinson

Related questions

Recent Activity

Donate For Us