I have a dataframe (with more rows and columns) as shown below.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
What I want:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
I tried to replicate the RDD
solution provided here: Pyspark: Split multiple array columns into rows
(df
.rdd
.flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
.toDF(["col1", "col2", "col3"]))
However, it is not giving the required result
Edit: The explode
option does not work because it is currently stored as string and the explode
function expects an array
You can use explode
but first you'll have to convert the string representation of the array into an array.
One way is to use regexp_replace
to remove the leading and trailing square brackets, followed by split
on ", "
.
from pyspark.sql.functions import col, explode, regexp_replace, split
df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+
Pault's solution should work perfectly fine although here is another solution which uses regexp_extract
instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:
from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split
df.withColumn("col2",
explode(
split(
regexp_extract(
regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()
Explanation:
regexp_replace(col("col2"), "\s", "")
will replace all spaces with empty string. regexp_extract
will extract the content of the column which start with [
and ends with ]
.split
for the comma separated values and finally explode
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With