Im using Spark 1.6.2 with dataframe
And i want to convert this dataframe
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10279.9003906| 13| 0.3| 0.5| 2.5| 3.0|540928.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 0|10748.7001953| 13| 0.3| 0.5| 2.5| 3.0|541243.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
to
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10514.3002929| 13| 0.3| 0.5| 2.5| 3.0|540928.0,541243.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
So, I want to reduce by Id, and calculate mean of P rows and concatenate M rows. But I coudn't do that using function agg of spark.
can you help me please
You can groupBy the column ID and then aggregate each column depending on what you need, mean and concat will help you.
from pyspark.sql.functions import first, collect_list, mean
df.groupBy("ID").agg(mean("P"), first("index"),
first("xinf"), first("xup"),
first("yinf"), first("ysup"),
collect_list("M"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With