I want capability to update value in nested dataset. For this I have a created as nested Dataset in Spark. It has below schema structure:-
root
|-- field_a: string (nullable = false)
|-- field_b: struct (nullable = true)
| |-- field_d: struct(nullable = false)
|-- field_not_to_update: string(nullable = true)
| |-- field_to_update: string(nullable = false)
| field_c: string (nullable = false)
Now I wanted to update value in field_to_update in the dataset. I have tried
aFooData.withColumn("field_b.field_d.field_to_update", lit("updated_val")
Also tried,
aFooData.foreach(new ClassWithForEachFunction());
where ClassWithForEachFunction implements ForEachFunction<Row> and has method public void call(Row aRow) to update field_to_update attribute. Tried same with lamda as well but it was throwing Task not serializable exception so has to go for long process.
None of them are fruitful so far and I am getting same Dataset with foreach and new column with name field_b.field_d.field_to_update in second case. Any other suggestions for same?
Please check below code.
scala> df.show(false)
+-------+--------------+
|field_a|field_b |
+-------+--------------+
|parentA|[srinivas, 20]|
|parentB|[ravi, 30] |
+-------+--------------+
scala> df.printSchema
root
|-- field_a: string (nullable = true)
|-- field_b: struct (nullable = true)
| |-- field_to_update: string (nullable = true)
| |-- field_not_to_update: integer (nullable = true)
scala> df.select("field_a","field_b.field_to_update","field_b.field_not_to_update").withColumn("field_to_update",lit("updated_val")).select(col("field_a"),struct(col("field_to_update"),col("field_not_to_update")).as("field_b")).show(false)
+-------+-----------------+
|field_a|field_b |
+-------+-----------------+
|parentA|[updated_val, 20]|
|parentB|[updated_val, 30]|
+-------+-----------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With