Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

groupby and convert multiple columns into a list using pyspark

I'm using pyspark. So I have a spark dataframe that looks like:

a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

Need Output:

a | b_list
5 | 2,1,4,3
2 | 4,2,3,7

It's important to keep the sequence as given in output.

like image 833
YOLO Avatar asked Oct 17 '22 19:10

YOLO


2 Answers

Instead of udf, for joining the list, we can also use concat_ws function as suggested in comments above, like this:

import pyspark.sql.functions as F

df = (df
      .withColumn('lst', F.concat(df['b'], F.lit(','), df['c']).alias('lst'))
      .groupBy('a')
      .agg( F.concat_ws(',', F.collect_list('lst').alias('b_list')).alias('lst')))

df.show()

+---+-------+
|  a|    lst|
+---+-------+
|  5|2,1,4,3|
|  2|4,2,3,7|
+---+-------+
like image 96
YOLO Avatar answered Dec 15 '22 05:12

YOLO


The following results in the last 2 columns aggregated into an array column:

df1 = df.withColumn('lst', f.concat(df['b'], f.lit(','), df['c']).alias('lst'))\
  .groupBy('a')\
  .agg( f.collect_list('lst').alias('b_list'))

Now join array elements:

#Simplistic udf to joing array:
def join_array(col):
    return ','.join(col)

join = f.udf(join_array)

df1.select('a', join(df1['b_list']).alias('b_list'))\
  .show()

Printing:

+---+-------+
|  a| b_list|
+---+-------+
|  5|2,1,4,3|
|  2|4,2,3,7|
+---+-------+
like image 36
ernest_k Avatar answered Dec 15 '22 03:12

ernest_k