Let say I have a DataFrame as follow :
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
The schema is :
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
I'm looking for a way to select only a subset of fields : id
and size
of the array column subClasss
, but with keeping the nested array structure.
The resulting schema would be :
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
I've tried to do a
df.select("subClasss.id","subClasss.size")
But this splits the array subClasss
in two arrays :
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- size: array (nullable = true)
| |-- element: integer (containsNull = true)
Is there a way to keep the origin structure and just to eliminate the useless
field ? Something that would look like :
df.select("subClasss.[id,size]")
Thanks for your time.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
To select a column from the Dataset, use apply method in Scala and col in Java. Note that the Column type can also be manipulated through its various functions. and in Java: // To create Dataset<Row> using SparkSession Dataset<Row> people = spark.
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
Spark >= 2.4:
It is possible to use arrays_zip
with cast
:
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip(
$"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))
where cast
is required to rename nested fields - without it Spark uses automatically generated names 0
, 1
, ... n
.
Spark < 2.4:
You can use an UDF like this:
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With