Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select a subset of fields from an array column in Spark?

Let say I have a DataFrame as follow :

case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
      MotherClass(Array(
        SubClass("1",1,"thisIsUseless"),
        SubClass("2",2,"thisIsUseless"),
        SubClass("3",3,"thisIsUseless")
      )),
      MotherClass(Array(
        SubClass("4",4,"thisIsUseless"),
        SubClass("5",5,"thisIsUseless")
      ))
    ))

The schema is :

root
 |-- subClasss: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- size: integer (nullable = false)
 |    |    |-- useless: string (nullable = true)

I'm looking for a way to select only a subset of fields : id and size of the array column subClasss, but with keeping the nested array structure. The resulting schema would be :

root
     |-- subClasss: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- size: integer (nullable = false)

I've tried to do a

df.select("subClasss.id","subClasss.size")

But this splits the array subClasss in two arrays :

root
 |-- id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- size: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Is there a way to keep the origin structure and just to eliminate the useless field ? Something that would look like :

df.select("subClasss.[id,size]")

Thanks for your time.

like image 498
jmvllt Avatar asked Apr 07 '16 12:04

jmvllt


People also ask

How do I subset columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do I select columns in Spark dataset?

To select a column from the Dataset, use apply method in Scala and col in Java. Note that the Column type can also be manipulated through its various functions. and in Java: // To create Dataset<Row> using SparkSession Dataset<Row> people = spark.

How do I extract a column in Spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .


1 Answers

Spark >= 2.4:

It is possible to use arrays_zip with cast:

import org.apache.spark.sql.functions.arrays_zip

df.select(arrays_zip(
  $"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))

where cast is required to rename nested fields - without it Spark uses automatically generated names 0, 1, ... n.

Spark < 2.4:

You can use an UDF like this:

import org.apache.spark.sql.Row

case class Record(id: String, size: Int)

val dropUseless = udf((xs: Seq[Row]) =>  xs.map{
  case Row(id: String, size: Int, _) => Record(id, size)
})

df.select(dropUseless($"subClasss"))
like image 121
zero323 Avatar answered Oct 04 '22 18:10

zero323