Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to modify a Spark Dataframe with a complex nested structure?

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:

I have schema defined as:

StructType(
    StructField(name,StringType,true), 
    StructField(data,ArrayType(
        StructType(
            StructField(name,StringType,true), 
            StructField(values,
                MapType(StringType,StringType,true),
            true)
        ),
        true
    ),
    true)
)

I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:

df.withColumn("data.values", functions.array(functions.lit(null)))

but this ultimately creates a new column of data.values and does not modify the values element of the data array.

like image 830
user2743583 Avatar asked Apr 20 '16 04:04

user2743583


People also ask

How do you modify a nested struct field?

The steps we have to follow are these: Iterate through the schema of the nested Struct and make the changes we want. Create a JSON version of the root level field, in our case groups, and name it for example groups_json and drop groups.

Can you edit the contents of an existing Spark DataFrame?

As mentioned earlier, Spark dataFrames are immutable. You cannot change existing dataFrame, instead, you can create new dataFrame with updated values.


2 Answers

Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:

case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])

val nullableDF = df.as[Root].map { root =>
  val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
  NullableRoot(root.name, nullableData)
}.toDF()

The resulting schema of nullableDF will be:

root
 |-- name: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
like image 196
Miguel Avatar answered Sep 27 '22 20:09

Miguel


I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame

like image 34
Eric Czech Avatar answered Sep 27 '22 20:09

Eric Czech