Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aggregate Dataframe pyspark

Im using Spark 1.6.2 with dataframe

And i want to convert this dataframe

+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10279.9003906|   13|    0.3|    0.5|    2.5|    3.0|540928.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        0|10748.7001953|   13|    0.3|    0.5|    2.5|    3.0|541243.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

to

+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10514.3002929|   13|    0.3|    0.5|    2.5|    3.0|540928.0,541243.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

So, I want to reduce by Id, and calculate mean of P rows and concatenate M rows. But I coudn't do that using function agg of spark.

can you help me please

like image 666
MrGildarts Avatar asked Oct 20 '16 19:10

MrGildarts


1 Answers

You can groupBy the column ID and then aggregate each column depending on what you need, mean and concat will help you.

from pyspark.sql.functions import first, collect_list, mean

df.groupBy("ID").agg(mean("P"), first("index"), 
                     first("xinf"), first("xup"), 
                     first("yinf"), first("ysup"), 
                     collect_list("M"))
like image 148
Alberto Bonsanto Avatar answered Nov 15 '22 06:11

Alberto Bonsanto