If I have a Dataframe containing a column of Array[String]:
scala> y.show
+---+----------+
|uid|event_comb|
+---+----------+
| c| [xx, zz]|
| b| [xx, xx]|
| b| [xx, yy]|
| b| [xx, zz]|
| b| [xx, yy]|
| b| [xx, zz]|
| b| [yy, zz]|
| a| [xx, yy]|
+---+----------+
How can I split the column "event_comb"
into two columns (e.g. "event1"
and "event2"
)?
The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.
If your column type is list or Map you can use getItem function to get the value
getItem(Object key)
An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.
val data = Seq(
("c", List("xx", "zz")),
("b", List("xx", "xx")),
("b", List("xx", "yy")),
("b", List("xx", "zz")),
("b", List("xx", "yy")),
("b", List("xx", "zz")),
("b", List("yy", "zz")),
("a", List("xx", "yy"))
).toDF("uid", "event_comb")
data.withColumn("event1", $"event_comb".getItem(0))
.withColumn("event2", $"event_comb".getItem(1))
.show(false)
Output:
+---+----------+------+------+
|uid|event_comb|event1|event2|
+---+----------+------+------+
|c |[xx, zz] |xx |zz |
|b |[xx, xx] |xx |xx |
|b |[xx, yy] |xx |yy |
|b |[xx, zz] |xx |zz |
|b |[xx, yy] |xx |yy |
|b |[xx, zz] |xx |zz |
|b |[yy, zz] |yy |zz |
|a |[xx, yy] |xx |yy |
+---+----------+------+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With