Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove field from array.struct in Spark

I want to delete one field from array.struct as follow:

 case class myObj (id: String, item_value: String, delete: String)
  case class myObj2 (id: String, item_value: String)

  val df2=Seq (
      ("1", "2","..100values", Seq(myObj ("A", "1a","1"),myObj ("B", "4r","2"))),
      ("1", "2","..100values", Seq(myObj ("X", "1p","11"),myObj ("V", "7w","8")))
  ).toDF("1","2","100fields","myArr")


val deleteColumn : (mutable.WrappedArray[myObj]=>mutable.WrappedArray[myObj2])= {
        (array: mutable.WrappedArray[myObj]) => array.map(o => myObj2(o.id, o.item_value))
      }
val myUDF3 = functions.udf(deleteColumn)
df2.withColumn("newArr",myUDF3($"myArr")).show(false)

Error is very clear:

Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array<struct<id:string,item_value:string,delete:string>>) => array<struct< id:string,item_value:string>>)

It does not match, but is that I want to do, parse from one structure to another ¿?

I am using a UDF because df.map() is not good for mapping specific column and it forces to indicates all columns. So I didn´t find best method to apply this mapping for one column.

like image 283
MrElephant Avatar asked Oct 27 '25 17:10

MrElephant


2 Answers

You can rewrite your UDF that takes a Row instead of custom object as below

val deleteColumn = udf((value: Seq[Row]) => {
  value.map(row => MyObj2(row.getString(0), row.getString(1)))
})

df2.withColumn("newArr", deleteColumn($"myArr"))

Output:

+---+---+-----------+---------------------+----------------+
|1  |2  |100fields  |myArr                |newArr          |
+---+---+-----------+---------------------+----------------+
|1  |2  |..100values|[[A,1a,1], [B,4r,2]] |[[A,1a], [B,4r]]|
|1  |2  |..100values|[[X,1p,11], [V,7w,8]]|[[X,1p], [V,7w]]|
+---+---+-----------+---------------------+----------------+
like image 164
koiralo Avatar answered Oct 29 '25 07:10

koiralo


Not using udf, one can easily remove fields from array of structs using dropFields together with transform.

Test input:

val df = spark.createDataFrame(Seq(("v1", "v2", "v3", "v4"))).toDF("f1", "f2", "f3", "f4")
    .select(
        array(
            struct("f1", "f2"),
            struct(col("f3").as("f1"), col("f4").as("f2")),
        ).as("myArr")
    )
df.printSchema()
// root
//  |-- myArr: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- f1: string (nullable = true)
//  |    |    |-- f2: string (nullable = true)

Script:

val df2 = df.withColumn(
    "myArr",
    transform(
        $"myArr",
        x => x.dropFields("f2")
    )
)
df2.printSchema()
// root
//  |-- myArr: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- f1: string (nullable = true)
like image 20
ZygD Avatar answered Oct 29 '25 06:10

ZygD



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!