In my application I have a need to create a single-row DataFrame from a Map.
So that a Map like
("col1" -> 5, "col2" -> 10, "col3" -> 6)
would be transformed into a DataFrame with a single row and the map keys would become names of columns.
col1 | col2 | col3
5 | 10 | 6
In case you are wondering why would I want this - I just need to save a single document with some statistics into MongoDB using MongoSpark connector which allows saving DFs and RDDs.
I thought that sorting the column names doesn't hurt anyway.
import org.apache.spark.sql.types._
val map = Map("col1" -> 5, "col2" -> 6, "col3" -> 10)
val (keys, values) = map.toList.sortBy(_._1).unzip
val rows = spark.sparkContext.parallelize(Seq(Row(values: _*)))
val schema = StructType(keys.map(
k => StructField(k, IntegerType, nullable = false)))
val df = spark.createDataFrame(rows, schema)
df.show()
Gives:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 5| 6| 10|
+----+----+----+
The idea is straightforward: convert map to list of tuples, unzip, convert the keys into a schema and the values into a single-entry row RDD, build dataframe from the two pieces (the interface for createDataFrame is a bit strange there, accepts java.util.Lists and kitchen sinks, but doesn't accept the usual scala List for some reason).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With