Rename pivoted and aggregated column in PySpark Dataframe

Tags:

With a dataframe as follows:

from pyspark.sql.functions import avg, first

rdd = sc.parallelize(
    [
        (0, "A", 223,"201603", "PORT"), 
        (0, "A", 22,"201602", "PORT"), 
        (0, "A", 422,"201601", "DOCK"), 
        (1,"B", 3213,"201602", "DOCK"), 
        (1,"B", 3213,"201601", "PORT"), 
        (2,"C", 2321,"201601", "DOCK")
    ]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

df_data.show()

I do a pivot on it,

df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show()

+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
|  2|   C|          2321.0|                DOCK|            null|                null|            null|                null|
|  0|   A|           422.0|                DOCK|            22.0|                PORT|           223.0|                PORT|
|  1|   B|          3213.0|                PORT|          3213.0|                DOCK|            null|                null|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+

But I get these really complicated names for the columns. Applying alias on the aggregation usually works, but because of the pivot in this case the names are even worse:

+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
|  2|   C|                                                        2321.0|                                                              DOCK|                                                          null|                                                              null|                                                          null|                                                              null|
|  0|   A|                                                         422.0|                                                              DOCK|                                                          22.0|                                                              PORT|                                                         223.0|                                                              PORT|
|  1|   B|                                                        3213.0|                                                              PORT|                                                        3213.0|                                                              DOCK|                                                          null|                                                              null|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+

Is there a way to rename the column names on the fly on the pivot and aggregation?

710

asked Jun 15 '16 13:06

3 Answers

A simple regular expression should do the trick:

import re

def clean_names(df):
    p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
    return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])

pivoted = df_data.groupby(...).pivot(...).agg(...)

clean_names(pivoted).printSchema()
## root
##  |-- id: long (nullable = true)
##  |-- type: string (nullable = true)
##  |-- 201601_cost: double (nullable = true)
##  |-- 201601_ship: string (nullable = true)
##  |-- 201602_cost: double (nullable = true)
##  |-- 201602_ship: string (nullable = true)
##  |-- 201603_cost: double (nullable = true)
##  |-- 201603_ship: string (nullable = true)

If you want to preserve function name you change substitution pattern to for example \1_\2_\3.

162

answered Oct 15 '22 11:10

zero323

A simple approach will be using alias after the aggregate function. I start with the df_data spark dataFrame you created.

df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost").alias("avg_cost"), first("ship").alias("first_ship")).show()
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| id|type|201601_avg_cost|201601_first_ship|201602_avg_cost|201602_first_ship|201603_avg_cost|201603_first_ship|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
|  1|   B|         3213.0|             PORT|         3213.0|             DOCK|           null|             null|
|  2|   C|         2321.0|             DOCK|           null|             null|           null|             null|
|  0|   A|          422.0|             DOCK|           22.0|             PORT|          223.0|             PORT|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+

column names will be the form of "original_column_name_aliased_column_name". For your case, original_column_name will be 201601, aliased_column_name will be avg_cost, and the column name is 201601_avg_cost(linked by underscore "_").

answered Oct 15 '22 09:10

Frank

You can alias the aggregations directly:

pivoted = df_data \
    .groupby(df_data.id, df_data.type) \
    .pivot("date") \
    .agg(
       avg('cost').alias('cost'),
       first("ship").alias('ship')
    )

pivoted.printSchema()
##root
##|-- id: long (nullable = true)
##|-- type: string (nullable = true)
##|-- 201601_cost: double (nullable = true)
##|-- 201601_ship: string (nullable = true)
##|-- 201602_cost: double (nullable = true)
##|-- 201602_ship: string (nullable = true)
##|-- 201603_cost: double (nullable = true)
##|-- 201603_ship: string (nullable = true)

answered Oct 15 '22 09:10

jastingo

Related questions
                            
                                Test if an array is broadcastable to a shape?
                            
                                pyspark ImportError: cannot import name accumulators
                            
                                How to view an RGB image with pylab
                            
                                Matrix scalar multiplication
                            
                                What is the most pythonic way to iterate over a list until a condition is met?
                            
                                Python - Extract pattern from string using RegEx
                            
                                NameError: name 'flask' is not defined
                            
                                Explain a code to check primality based on Fermat's little theorem
                            
                                TypeError: unsupported operand type(s) for ^: 'numpy.float64' and 'numpy.float64'
                            
                                Seaborn regplot with colorbar?
                            
                                python how to write list of lists to file
                            
                                How to insert values in mysql from loop in Python [duplicate]
                            
                                Concatenate two lists side by side
                            
                                Keep Django runserver alive when SSH is closed
                            
                                smallest negative int in Python [duplicate]
                            
                                How to make a bar plot of non-numerical data in pandas
                            
                                Show group members in Django Admin
                            
                                Django Table already exist
                            
                                Jupyter on EC2: SSL Error
                            
                                Python Split string in a certain length

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rename pivoted and aggregated column in PySpark Dataframe

Tags:

python

apache-spark

apache-spark-sql

pyspark

Ivan

People also ask

3 Answers

zero323

Frank

jastingo

Recent Activity

Donate For Us