I'm using pyspark. So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
Need Output:
a | b_list
5 | 2,1,4,3
2 | 4,2,3,7
It's important to keep the sequence as given in output.
Instead of udf, for joining the list, we can also use concat_ws
function as suggested in comments above, like this:
import pyspark.sql.functions as F
df = (df
.withColumn('lst', F.concat(df['b'], F.lit(','), df['c']).alias('lst'))
.groupBy('a')
.agg( F.concat_ws(',', F.collect_list('lst').alias('b_list')).alias('lst')))
df.show()
+---+-------+
| a| lst|
+---+-------+
| 5|2,1,4,3|
| 2|4,2,3,7|
+---+-------+
The following results in the last 2 columns aggregated into an array column:
df1 = df.withColumn('lst', f.concat(df['b'], f.lit(','), df['c']).alias('lst'))\
.groupBy('a')\
.agg( f.collect_list('lst').alias('b_list'))
Now join array elements:
#Simplistic udf to joing array:
def join_array(col):
return ','.join(col)
join = f.udf(join_array)
df1.select('a', join(df1['b_list']).alias('b_list'))\
.show()
Printing:
+---+-------+
| a| b_list|
+---+-------+
| 5|2,1,4,3|
| 2|4,2,3,7|
+---+-------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With