Converting a Spark Dataframe to a Scala Map collection

Tags:

I'm trying to find the best solution to convert an entire Spark dataframe to a scala Map collection. It is best illustrated as follows:

To go from this (in the Spark examples):

val df = sqlContext.read.json("examples/src/main/resources/people.json")

df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

To a Scala collection (Map of Maps) represented like this:

val people = Map(
Map("age" -> null, "name" -> "Michael"),
Map("age" -> 30, "name" -> "Andy"),
Map("age" -> 19, "name" -> "Justin")
)

459

asked Apr 27 '16 16:04

Jimmy Hendricks

Video Answer

2 Answers

I don't think your question makes sense -- your outermost Map, I only see you are trying to stuff values into it -- you need to have key / value pairs in your outermost Map. That being said:

val peopleArray = df.collect.map(r => Map(df.columns.zip(r.toSeq):_*))

Will give you:

Array(
  Map("age" -> null, "name" -> "Michael"),
  Map("age" -> 30, "name" -> "Andy"),
  Map("age" -> 19, "name" -> "Justin")
)

At that point you could do:

val people = Map(peopleArray.map(p => (p.getOrElse("name", null), p)):_*)

Which would give you:

Map(
  ("Michael" -> Map("age" -> null, "name" -> "Michael")),
  ("Andy" -> Map("age" -> 30, "name" -> "Andy")),
  ("Justin" -> Map("age" -> 19, "name" -> "Justin"))
)

I'm guessing this is really more what you want. If you wanted to key them on an arbitrary Long index, you can do:

val indexedPeople = Map(peopleArray.zipWithIndex.map(r => (r._2, r._1)):_*)

Which gives you:

Map(
  (0 -> Map("age" -> null, "name" -> "Michael")),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)

131

answered Sep 19 '22 08:09

David Griffin

First get the schema from Dataframe

val schemaList = dataframe.schema.map(_.name).zipWithIndex//get schema list from dataframe

Get the rdd from dataframe and mapping with it

dataframe.rdd.map(row =>
  //here rec._1 is column name and rce._2 index
  schemaList.map(rec => (rec._1, row(rec._2))).toMap
 ).collect.foreach(println)

answered Sep 22 '22 08:09

Gabber

Related questions
                            
                                Write spark dataframe to file using python and '|' delimiter
                            
                                How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?
                            
                                How to start multiple streaming queries in a single Spark application?
                            
                                PySpark: how to resample frequencies
                            
                                Enable case sensitivity for spark.sql globally
                            
                                How to interpret results of Spark OneHotEncoder
                            
                                Spark converting a Dataset to RDD
                            
                                On which way does RDD of spark finish fault-tolerance?
                            
                                Spark dataframe write method writing many small files
                            
                                Spark structured streaming kafka convert JSON without schema (infer schema)
                            
                                Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?
                            
                                Specifying an external configuration file for Apache Spark
                            
                                PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Spark - Random Number Generation
                            
                                Could not bind on a random free port error while trying to connect to spark master
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                How to get ID of a map task in Spark?
                            
                                pyspark matrix with dummy variables
                            
                                Spark column string replace when present in other column (row)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting a Spark Dataframe to a Scala Map collection

Tags:

dataframe

apache-spark

apache-spark-sql