How to modify a Spark Dataframe with a complex nested structure?

Tags:

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:

I have schema defined as:

Click to copy

StructType(
    StructField(name,StringType,true), 
    StructField(data,ArrayType(
        StructType(
            StructField(name,StringType,true), 
            StructField(values,
                MapType(StringType,StringType,true),
            true)
        ),
        true
    ),
    true)
)

I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:

Click to copy

df.withColumn("data.values", functions.array(functions.lit(null)))

but this ultimately creates a new column of data.values and does not modify the values element of the data array.

830

asked Apr 20 '16 04:04

user2743583

2 Answers

Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:

Click to copy

case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])

val nullableDF = df.as[Root].map { root =>
  val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
  NullableRoot(root.name, nullableData)
}.toDF()

The resulting schema of nullableDF will be:

Click to copy

root
 |-- name: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

196

answered Sep 27 '22 20:09

Miguel

I ran into the same issue and assuming you don't need the result to have any new fields or fields with different types, here is a solution that can do this without having to redefine the whole struct: Change value of nested column in DataFrame

answered Sep 27 '22 20:09

Eric Czech

Related questions
                            
                                What's the recommended way to pass the results of macro computations to run-time?
                            
                                how does work scalaz.Validation loopSuccess and loopFailure
                            
                                How to make an INSERT IGNORE query with Slick?
                            
                                How to make JavaFX Applications non-blurry on Retina Display
                            
                                Scala REPL "paste" mode doesn't exit on ctrl-D in Sublime Text 2
                            
                                Why is the Empty input case needed in Scala Iteratees?
                            
                                corrupt resolve for Play 2 framework support in IntelliJ IDEA
                            
                                Quasiquotes for multiple parameters and parameter lists
                            
                                LibGDX project written in Scala, on Android, using IntelliJ
                            
                                Check if a Scala / Akka actor is terminated
                            
                                SBT: Cross-platform way to set java.library.path?
                            
                                Scala, GUI and immutability
                            
                                How to get a reference to an existing ActorSystem in Akka?
                            
                                Is there any OWASP checking tool for scala project?
                            
                                insertOrUpdate with Slick 3
                            
                                How do I distribute a Scala macro as a project?
                            
                                How to prevent SBT from trying to download from official repositories?
                            
                                Misunderstanding with type checks in Scala
                            
                                Should functions that return Future[A] throw exceptions?
                            
                                How do I Unit test/mock ElasticSearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to modify a Spark Dataframe with a complex nested structure?

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

user2743583

People also ask

2 Answers

Miguel

Eric Czech

Recent Activity

Donate For Us