Pivot on two columns with both numeric and categorical value in pySpark

Question

I have a data set in pyspark like this : from collections import namedtuple

    user_row = namedtuple('user_row', 'id time category value'.split())
    data = [
        user_row(1,1,'speed','50'),
        user_row(1,1,'speed','60'),
        user_row(1,2,'door', 'open'),
        user_row(1,2,'door','open'),
        user_row(1,2,'door','close'),
        user_row(1,2,'speed','75'),
        user_row(2,10,'speed','30'), 
        user_row(2,11,'door', 'open'),
        user_row(2,12,'door','open'),
        user_row(2,13,'speed','50'),
        user_row(2,13,'speed','40')
    ]
    
    user_df = spark.createDataFrame(data)
    user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
|  1|   1|   speed|   50|
|  1|   1|   speed|   60|
|  1|   2|    door| open|
|  1|   2|    door| open|
|  1|   2|    door|close|
|  1|   2|   speed|   75|
|  2|  10|   speed|   30|
|  2|  11|    door| open|
|  2|  12|    door| open|
|  2|  13|   speed|   50|
|  2|  13|   speed|   40|
+---+----+--------+-----+

What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.

+---+----+--------+-----+
| id|time|    door|speed|
+---+----+--------+-----+
|  1|   1|    null|   55|
|  1|   2|    open|   75|
|  2|  10|    null|   30|
|  2|  11|    open| null|
|  2|  12|    open| null|
|  2|  13|    null|   45|
+---+----+--------+-----+

I tried this but for categorical value it returns null (I am not worry about nulls in speed column)

    df = user_df\
    .groupBy('id','time')\
    .pivot('category')\
    .agg(avg('value'))\
    .orderBy(['id', 'time'])\
    
    df.show()

+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
|  1|   1|null| 55.0|
|  1|   2|null| 75.0|
|  2|  10|null| 30.0|
|  2|  11|null| null|
|  2|  12|null| null|
|  2|  13|null| 45.0|
+---+----+----+-----+

Raghu · Accepted Answer

You can do an additional pivot and coalesce them. try this.

import pyspark.sql.functions as F
from collections import namedtuple

user_row = namedtuple('user_row', 'id time category value'.split())
data = [
    user_row(1,1,'speed','50'),
    user_row(1,1,'speed','60'),
    user_row(1,2,'door', 'open'),
    user_row(1,2,'door','open'),
    user_row(1,2,'door','close'),
    user_row(1,2,'speed','75'),
    user_row(2,10,'speed','30'), 
    user_row(2,11,'door', 'open'),
    user_row(2,12,'door','open'),
    user_row(2,13,'speed','50'),
    user_row(2,13,'speed','40')
]

user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
            .pivot('category')\
            .agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
    
df_final = df.select('id','time',*sel_expr).orderBy('id','time')

df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
|  1|   1|null| 55.0|
|  1|   2|open| 75.0|
|  2|  10|null| 30.0|
|  2|  11|open| null|
|  2|  12|open| null|
|  2|  13|null| 45.0|
+---+----+----+-----+

Pivot on two columns with both numeric and categorical value in pySpark

Tags:

jupyter-notebook

pivot

apache-spark

apache-spark-sql

pyspark

Maddie

1 Answers

Raghu

Recent Activity

Donate For Us

Pivot on two columns with both numeric and categorical value in pySpark

Tags:

jupyter-notebook

pivot

apache-spark

apache-spark-sql

pyspark

Maddie

1 Answers

Raghu

Related questions

Recent Activity

Donate For Us