Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove null from array columns in Dataframe in Scala with Spark (1.6)

I have a dataframe with "id" column and a column which has an array of struct. The schema:

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: long (nullable = false)

The array "desc" can have any number of null values. I would like to create the final dataframe with the array having none of the null values using Spark 1.6:

An example would be:

Key  .   Value
1010 .   [[George,21],null,[MARIE,13],null]
1023 .   [null,[Watson,11],[John,35],null,[Kyle,33]]

I want the final dataframe as:

id   .   desc
1010 .   [[George,21],[MARIE,13]]
1023 .   [[Watson,11],[John,35],[Kyle,33]]

I tried doing this with UDF and case class but got

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to....

Any help is greatly appreciated and I would prefer doing it without converting to RDDs if needed.

like image 406
rayban Avatar asked Mar 03 '26 00:03

rayban


1 Answers

Here is another version:

case class Person(name: String, age: Int)

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: integer (nullable = false)

+----+-----------------------------------------------+
|id  |desc                                           |
+----+-----------------------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|
+----+-----------------------------------------------+


val filterOutNull = udf((xs: Seq[Row]) => {
  xs.flatMap {
    case null => Nil
    // convert the Row back to your specific struct:
    case Row(s: String,i: Int) => List(Person(s, i))
  }
})

val result = df.withColumn("filteredListDesc", filterOutNull($"desc"))

+----+-----------------------------------------------+-----------------------------------+
|id  |desc                                           |filteredListDesc                   |
+----+-----------------------------------------------+-----------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |[[George,21], [MARIE,13]]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------------------+-----------------------------------+
like image 174
1pluszara Avatar answered Mar 04 '26 21:03

1pluszara



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!