Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert an array to string efficiently in PySpark / Python

Tags:

python

pyspark

I have a df with the following schema:

root
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

in which one of the columns, col2 is an array [1#b, 2#b, 3#c]. I want to convert this to the string format 1#b,2#b,3#c.

I am currently doing this through the following snippet

df2 = (df1.select("*", explode(col2)).drop('col2'))
df2.groupBy("col1").agg(concat_ws(",", collect_list('col')).alias("col2"))

While this gets the job done, it is taking time and also seems inefficient.

Is there a better alternative?

like image 990
Raj Avatar asked Dec 07 '22 16:12

Raj


1 Answers

You can call concat_ws directly on a column, like this:

df1.withColumn('col2', concat_ws(',', 'col2'))
like image 100
Mariusz Avatar answered Jan 03 '23 04:01

Mariusz