Split Spark data frame of string column into multiple boolean columns

Question

We have a spark data frame that looks like this:

   id | value 
------+--------
    0 | A,B
    1 | A,C
    2 | B

We would like to transform it into:

   id | A     | B     | C
------+-------+-------+-------+
    0 | True  | True  | False | 
    1 | True  | False | True  | 
    2 | False | True  | False |

What's the best way to do such a transformation?

michalrudko · Accepted Answer

Let's assume it's your input data frame:

df = spark.createDataFrame([(0,["A","B"]),(1,["A","C"]),(2, ["B"])],["id","value"])

Then using explode & pivot you obtain a table with ints and nulls.

df2 = df.withColumn("x",explode(df.value)).drop("value").groupBy("id").pivot("x").count()
df2.show()

+---+----+----+----+
| id|   A|   B|   C|
+---+----+----+----+
|  0|   1|   1|null|
|  1|   1|null|   1|
|  2|null|   1|null|
+---+----+----+----+

In the end, you just need to convert the values to boolean, for example:

for col_name in df2.columns[1:]:
    df2 = df2.withColumn(col_name, col(col_name).isNotNull())

df2.show()

+---+-----+-----+-----+
| id|    A|    B|    C|
+---+-----+-----+-----+
|  0| true| true|false|
|  1| true|false| true|
|  2|false| true|false|
+---+-----+-----+-----+

abiratsis · Answer

Here is one approach with Scala:

val df = Seq(
(0,"A,B"),
(1,"A,C"),
(2,"B"))
.toDF("id","value")

    //store array from split
    val withArrayDF = df.withColumn("array", split($"value", ",")).drop("value")

    //get sorted unique values for the whole dataset
    val distinctValues = withArrayDF.select(explode($"array")).distinct.collect.map{_.getString(0)}.sorted.toList

    //foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
    distinctValues.map{ncol =>
      withArrayDF.withColumn(ncol, array_contains($"array", ncol)).drop("array")
    }.reduce(_.join(_,"id"))//join all A, B, C 
     .select("id", distinctValues:_*)
     .show

And the output:

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+

And the python version:

from pyspark.sql.functions import array_contains, split, when, col, explode
from functools import reduce

df = spark.createDataFrame(
[(0,"A,B"),
(1,"A,C"),
(2,"B")], ["id","value"])

# store array from split
withArrayDF = df.withColumn("array", split(df["value"], ",")).drop("value")

# get sorted unique values for the whole dataset
distinctValues = sorted(
                    list(
                      map(lambda row: row[0], withArrayDF.select(explode("array")).distinct().collect())))


# foreach A,B,C create new column called ncol. When ncol is present in array(i) true otherwise false 
mappedDFs = list(
              map(lambda ncol: 
                  withArrayDF
                          .withColumn(ncol, array_contains(col("array"), ncol))
                          .drop("array"), 
                  distinctValues
                 ))

finalDF = reduce(lambda x,y: x.join(y, "id"), mappedDFs)
finalDF.show()

Output:

+---+-----+-----+-----+ 
| id|    A|    B|    C| 
+---+-----+-----+-----+ 
|  0| true| true|false| 
|  1| true|false| true| 
|  2|false| true|false| 
+---+-----+-----+-----+

Split Spark data frame of string column into multiple boolean columns

Tags:

pyspark

Yuchen

2 Answers

michalrudko

abiratsis

Recent Activity

Donate For Us

Split Spark data frame of string column into multiple boolean columns

Tags:

pyspark

Yuchen

2 Answers

michalrudko

abiratsis

Related questions

Recent Activity

Donate For Us