Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transpose a dataframe in Pyspark

how can I do to transpose the following data frame in Pyspark?

The idea is to achieve the result that appears below.

import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

   id place  value attribute
a   1     A     10      size
b   1     A     30    height
c   1     A     20    weigth
d   2     A     10      size
e   2     A     30    height
f   2     A     20    weigth
g   3     A     10      size
h   3     A     30    height
i   3     A     20    weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
     'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

   id place  size  height  weigth
a   1     A    10      10      10
b   2     A    30      30      30
c   3     A    20      20      20

Any help is welcome. From already thank you very much

like image 428
lolo Avatar asked Nov 23 '18 21:11

lolo


2 Answers

First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:

from pyspark.sql.functions import first
l =[( 1        ,'A', 10, 'size' ),
( 1        , 'A', 30, 'height' ),
( 1        , 'A', 20, 'weigth' ),
( 2        , 'A', 10, 'size' ),
( 2        , 'A', 30, 'height' ),
( 2        , 'A', 20, 'weigth' ),
( 3        , 'A', 10, 'size' ),
( 3        , 'A', 30, 'height' ),
( 3        , 'A', 20, 'weigth' )]

df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

+---+-----+------+----+------+ 
| id|place|height|size|weigth|
+---+-----+------+----+------+ 
|  2|    A|    30|  10|    20| 
|  3|    A|    30|  10|    20| 
|  1|    A|    30|  10|    20|
+---+-----+------+----+------+
like image 142
cronoik Avatar answered Oct 19 '22 11:10

cronoik


Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.

df = df.groupBy(["id","place"]).pivot("attribute").sum("value")

This link also addresses the same question.

like image 35
cph_sto Avatar answered Oct 19 '22 11:10

cph_sto