In PySpark, how to split strings in all columns to a list of string?
a = [('a|q|e','d|r|y'),('j|l|f','m|g|j')]
df = sc.createDataFrame(a,['col1','col2'])
+-----+-----+
| col1| col2|
+-----+-----+
|a|q|e|d|r|y|
|j|l|f|m|g|j|
+-----+-----+
Output expected:
+---------+---------+
| col1| col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+
I can do single column at a time by using withColumn
but not an appealing solution with dynamic number of columns.
from pyspark.sql.functions import col, split
outDF = df.withColumn("col1", split(col("col1"), "\\|").alias("col1"))
One option is to create a list of column expressions first, and then leverage select
method with varargs syntax:
from pyspark.sql.functions import col, split
cols = ['col1', 'col2'] # columns to split
col_exprs = [split(col(x), "\\|").alias(x) for x in cols]
df.select(*col_exprs).show()
+---------+---------+
| col1| col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+
Another option using reduce
from functools
with withColumn
to dynamically create new columns:
from functools import reduce
reduce(
lambda df, colname: df.withColumn(colname, split(col(colname), "\\|").alias(colname)),
cols,
df
).show()
+---------+---------+
| col1| col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+
reduce(lambda df, colname: df.withColumn(colname, split(col(colname), "\\|").alias(colname)), cols, df).explain()
# == Physical Plan ==
# *Project [split(col1#0, \|) AS col1#76, split(col2#1, \|) AS col2#81]
# +- Scan ExistingRDD[col1#0,col2#1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With