Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a Spark Dataframe to a Scala Map collection

I'm trying to find the best solution to convert an entire Spark dataframe to a scala Map collection. It is best illustrated as follows:

To go from this (in the Spark examples):

val df = sqlContext.read.json("examples/src/main/resources/people.json")

df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

To a Scala collection (Map of Maps) represented like this:

val people = Map(
Map("age" -> null, "name" -> "Michael"),
Map("age" -> 30, "name" -> "Andy"),
Map("age" -> 19, "name" -> "Justin")
)
like image 459
Jimmy Hendricks Avatar asked Apr 27 '16 16:04

Jimmy Hendricks


People also ask

How do you convert a DataFrame to a list in Scala spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .

Which method can be used to convert a spark Dataset to a DataFrame?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

What is map transformation in spark?

Spark Map Transformation. A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.

How do you convert a DataFrame to a key value pair in PySpark?

Method 1: Using df.toPandas() Convert the PySpark data frame to Pandas data frame using df. toPandas(). Return type: Returns the pandas data frame having the same content as Pyspark Dataframe. Get through each column value and add the list of values to the dictionary with the column name as the key.


Video Answer


2 Answers

I don't think your question makes sense -- your outermost Map, I only see you are trying to stuff values into it -- you need to have key / value pairs in your outermost Map. That being said:

val peopleArray = df.collect.map(r => Map(df.columns.zip(r.toSeq):_*))

Will give you:

Array(
  Map("age" -> null, "name" -> "Michael"),
  Map("age" -> 30, "name" -> "Andy"),
  Map("age" -> 19, "name" -> "Justin")
)

At that point you could do:

val people = Map(peopleArray.map(p => (p.getOrElse("name", null), p)):_*)

Which would give you:

Map(
  ("Michael" -> Map("age" -> null, "name" -> "Michael")),
  ("Andy" -> Map("age" -> 30, "name" -> "Andy")),
  ("Justin" -> Map("age" -> 19, "name" -> "Justin"))
)

I'm guessing this is really more what you want. If you wanted to key them on an arbitrary Long index, you can do:

val indexedPeople = Map(peopleArray.zipWithIndex.map(r => (r._2, r._1)):_*)

Which gives you:

Map(
  (0 -> Map("age" -> null, "name" -> "Michael")),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)
like image 131
David Griffin Avatar answered Sep 19 '22 08:09

David Griffin


First get the schema from Dataframe

val schemaList = dataframe.schema.map(_.name).zipWithIndex//get schema list from dataframe

Get the rdd from dataframe and mapping with it

dataframe.rdd.map(row =>
  //here rec._1 is column name and rce._2 index
  schemaList.map(rec => (rec._1, row(rec._2))).toMap
 ).collect.foreach(println)
like image 23
Gabber Avatar answered Sep 22 '22 08:09

Gabber