Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala Dataframe convert a column of Array of Struct to a column of Map

I am new to Scala. I have a Dataframe with fields

ID:string, Time:timestamp, Items:array(struct(name:string,ranking:long))

I want to convert each row of the Items field to a hashmap, with the name as the key. I am not very sure how to do this.

like image 511
YIWEN GONG Avatar asked Jul 14 '17 18:07

YIWEN GONG


People also ask

How do I extract a column from a DataFrame in Scala?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .

How do you convert struct to map in PySpark?

Solution: PySpark provides a create_map() function that takes a list of column types as an argument and returns a MapType column, so we can use this to convert the DataFrame struct column to map Type. struct is a type of StructType and MapType is used to store Dictionary key-value pair.

What is StructType and StructField in spark?

The StructType and StructFields are used to define a schema or its part for the Dataframe. This defines the name, datatype, and nullable flag for each column. StructType object is the collection of StructFields objects. It is a Built-in datatype that contains the list of StructField.

What is a StructType in Scala?

StructType objects define the schema of Spark DataFrames. StructType objects contain a list of StructField objects that define the name, type, and nullable flag for each column in a DataFrame.


1 Answers

This can be done using a UDF:

import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

// Sample data:
val df = Seq(
  ("id1", "t1", Array(("n1", 4L), ("n2", 5L))),
  ("id2", "t2", Array(("n3", 6L), ("n4", 7L)))
).toDF("ID", "Time", "Items")

// Create UDF converting array of (String, Long) structs to Map[String, Long]
val arrayToMap = udf[Map[String, Long], Seq[Row]] {
  array => array.map { case Row(key: String, value: Long) => (key, value) }.toMap
}

// apply UDF
val result = df.withColumn("Items", arrayToMap($"Items"))

result.show(false)
// +---+----+---------------------+
// |ID |Time|Items                |
// +---+----+---------------------+
// |id1|t1  |Map(n1 -> 4, n2 -> 5)|
// |id2|t2  |Map(n3 -> 6, n4 -> 7)|
// +---+----+---------------------+

I can't see a way to do this without a UDF (using Spark's built-in functions only).

like image 145
Tzach Zohar Avatar answered Nov 01 '22 02:11

Tzach Zohar