I'm trying to find the best solution to convert an entire Spark dataframe to a scala Map collection. It is best illustrated as follows:
To go from this (in the Spark examples):
val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
To a Scala collection (Map of Maps) represented like this:
val people = Map(
Map("age" -> null, "name" -> "Michael"),
Map("age" -> 30, "name" -> "Andy"),
Map("age" -> 19, "name" -> "Justin")
)
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
Spark Map Transformation. A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.
Method 1: Using df.toPandas() Convert the PySpark data frame to Pandas data frame using df. toPandas(). Return type: Returns the pandas data frame having the same content as Pyspark Dataframe. Get through each column value and add the list of values to the dictionary with the column name as the key.
I don't think your question makes sense -- your outermost Map
, I only see you are trying to stuff values into it -- you need to have key / value pairs in your outermost Map
. That being said:
val peopleArray = df.collect.map(r => Map(df.columns.zip(r.toSeq):_*))
Will give you:
Array(
Map("age" -> null, "name" -> "Michael"),
Map("age" -> 30, "name" -> "Andy"),
Map("age" -> 19, "name" -> "Justin")
)
At that point you could do:
val people = Map(peopleArray.map(p => (p.getOrElse("name", null), p)):_*)
Which would give you:
Map(
("Michael" -> Map("age" -> null, "name" -> "Michael")),
("Andy" -> Map("age" -> 30, "name" -> "Andy")),
("Justin" -> Map("age" -> 19, "name" -> "Justin"))
)
I'm guessing this is really more what you want. If you wanted to key them on an arbitrary Long
index, you can do:
val indexedPeople = Map(peopleArray.zipWithIndex.map(r => (r._2, r._1)):_*)
Which gives you:
Map(
(0 -> Map("age" -> null, "name" -> "Michael")),
(1 -> Map("age" -> 30, "name" -> "Andy")),
(2 -> Map("age" -> 19, "name" -> "Justin"))
)
First get the schema from Dataframe
val schemaList = dataframe.schema.map(_.name).zipWithIndex//get schema list from dataframe
Get the rdd from dataframe and mapping with it
dataframe.rdd.map(row =>
//here rec._1 is column name and rce._2 index
schemaList.map(rec => (rec._1, row(rec._2))).toMap
).collect.foreach(println)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With