Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split a Spark column of Array[String] into columns of String

If I have a Dataframe containing a column of Array[String]:

scala> y.show
+---+----------+
|uid|event_comb|
+---+----------+
|  c|  [xx, zz]|
|  b|  [xx, xx]|
|  b|  [xx, yy]|
|  b|  [xx, zz]|
|  b|  [xx, yy]|
|  b|  [xx, zz]|
|  b|  [yy, zz]|
|  a|  [xx, yy]|
+---+----------+

How can I split the column "event_comb" into two columns (e.g. "event1" and "event2")?

like image 201
Jonathan Avatar asked Feb 09 '18 08:02

Jonathan


People also ask

How do I split a string into multiple columns in spark?

The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.


1 Answers

If your column type is list or Map you can use getItem function to get the value

getItem(Object key)

An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.

val data = Seq(
    ("c", List("xx", "zz")),
  ("b", List("xx", "xx")),
  ("b", List("xx", "yy")),
  ("b", List("xx", "zz")),
  ("b", List("xx", "yy")),
  ("b", List("xx", "zz")),
  ("b", List("yy", "zz")),
  ("a", List("xx", "yy"))
  ).toDF("uid", "event_comb")

  data.withColumn("event1", $"event_comb".getItem(0))
      .withColumn("event2", $"event_comb".getItem(1))
      .show(false)

Output:

+---+----------+------+------+
|uid|event_comb|event1|event2|
+---+----------+------+------+
|c  |[xx, zz]  |xx    |zz    |
|b  |[xx, xx]  |xx    |xx    |
|b  |[xx, yy]  |xx    |yy    |
|b  |[xx, zz]  |xx    |zz    |
|b  |[xx, yy]  |xx    |yy    |
|b  |[xx, zz]  |xx    |zz    |
|b  |[yy, zz]  |yy    |zz    |
|a  |[xx, yy]  |xx    |yy    |
+---+----------+------+------+
like image 106
koiralo Avatar answered Oct 14 '22 04:10

koiralo