I have this PySpark dataframe
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |[test, test2, test3]|
| 2 |[test4, test, test6]|
| 3 |[test6, test9, t55o]|
and I want to convert the column test_123
to be like this:
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |"test,test2,test3" |
| 2 |"test4,test,test6" |
| 3 |"test6,test9,t55o" |
so from list to be string.
how can I do it with PySpark?
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
To convert a list to a string, use Python List Comprehension and the join() function. The list comprehension will traverse the elements one by one, and the join() method will concatenate the list's elements into a new string and return it as output.
While you can use a UserDefinedFunction
it is very inefficient. Instead it is better to use concat_ws
function:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
You can create a udf
that joins array/list and then apply it to the test column:
from pyspark.sql.functions import udf, col
join_udf = udf(lambda x: ",".join(x))
df.withColumn("test_123", join_udf(col("test_123"))).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
The initial data frame is created from:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With