How to select a subset of fields from an array column in Spark?

Tags:

Let say I have a DataFrame as follow :

case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
      MotherClass(Array(
        SubClass("1",1,"thisIsUseless"),
        SubClass("2",2,"thisIsUseless"),
        SubClass("3",3,"thisIsUseless")
      )),
      MotherClass(Array(
        SubClass("4",4,"thisIsUseless"),
        SubClass("5",5,"thisIsUseless")
      ))
    ))

The schema is :

root
 |-- subClasss: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- size: integer (nullable = false)
 |    |    |-- useless: string (nullable = true)

I'm looking for a way to select only a subset of fields : id and size of the array column subClasss, but with keeping the nested array structure. The resulting schema would be :

root
     |-- subClasss: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- size: integer (nullable = false)

I've tried to do a

df.select("subClasss.id","subClasss.size")

But this splits the array subClasss in two arrays :

root
 |-- id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- size: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Is there a way to keep the origin structure and just to eliminate the useless field ? Something that would look like :

df.select("subClasss.[id,size]")

Thanks for your time.

498

asked Apr 07 '16 12:04

jmvllt

1 Answers

Spark >= 2.4:

It is possible to use arrays_zip with cast:

import org.apache.spark.sql.functions.arrays_zip

df.select(arrays_zip(
  $"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))

where cast is required to rename nested fields - without it Spark uses automatically generated names 0, 1, ... n.

Spark < 2.4:

You can use an UDF like this:

import org.apache.spark.sql.Row

case class Record(id: String, size: Int)

val dropUseless = udf((xs: Seq[Row]) =>  xs.map{
  case Row(id: String, size: Int, _) => Record(id, size)
})

df.select(dropUseless($"subClasss"))

121

answered Oct 04 '22 18:10

zero323

Related questions
                            
                                Play framework: how to monitor number of active sessions with standard session API?
                            
                                Building tests in Intellij for Play Framework is very slow
                            
                                How to make -Dsbt.override.build.repos=true global for SBT?
                            
                                How can I make multiple parameters in an anonymous function implicit?
                            
                                The "right" way to use write Slick 3.0 Scala queries in Play Framework
                            
                                How to return Unit from a scala function?
                            
                                Query one row with max value in one column in Slick
                            
                                Unresponsive actor system: ThreadPoolExecutor dispatcher only creates core thread pool, apparently ignores max thread pool size
                            
                                Can I change Spark's executor memory at runtime?
                            
                                Scala type erasure in pattern matching Map[String, Int]
                            
                                NoSuchMethodError writing Avro object to HDFS using Builder
                            
                                Spark joinWithCassandraTable() on map multiple partition key ERROR
                            
                                Relation of free monad and AST
                            
                                Using scala.Future with Java 8 lambdas
                            
                                importance of cake pattern in scala
                            
                                combining slick queries into single query
                            
                                Scodec combinators: Header contains magic number that is used to discriminate types
                            
                                SQLContext implicits
                            
                                How can I copy Scala.js source maps using resourceGenerators?
                            
                                Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to select a subset of fields from an array column in Spark?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

jmvllt

People also ask

1 Answers

zero323

Recent Activity

Donate For Us