Im using Spark 1.6.2 with dataframe
And i want to convert this dataframe
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10279.9003906| 13| 0.3| 0.5| 2.5| 3.0|540928.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 0|10748.7001953| 13| 0.3| 0.5| 2.5| 3.0|541243.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
to
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10514.3002929| 13| 0.3| 0.5| 2.5| 3.0|540928.0,541243.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
So, I want to reduce by Id, and calculate mean of P rows and concatenate M rows. But I coudn't do that using function agg of spark.
can you help me please
You can groupBy
the column ID
and then aggregate each column depending on what you need, mean
and concat
will help you.
from pyspark.sql.functions import first, collect_list, mean
df.groupBy("ID").agg(mean("P"), first("index"),
first("xinf"), first("xup"),
first("yinf"), first("ysup"),
collect_list("M"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With