Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - convert Map to a single-row DataFrame

In my application I have a need to create a single-row DataFrame from a Map.

So that a Map like

("col1" -> 5, "col2" -> 10, "col3" -> 6)

would be transformed into a DataFrame with a single row and the map keys would become names of columns.

col1 | col2 | col3
5    | 10   | 6

In case you are wondering why would I want this - I just need to save a single document with some statistics into MongoDB using MongoSpark connector which allows saving DFs and RDDs.

like image 829
Daniil Andreyevich Baunov Avatar asked Jan 03 '23 15:01

Daniil Andreyevich Baunov


1 Answers

I thought that sorting the column names doesn't hurt anyway.

  import org.apache.spark.sql.types._
  val map = Map("col1" -> 5, "col2" -> 6, "col3" -> 10)
  val (keys, values) = map.toList.sortBy(_._1).unzip
  val rows = spark.sparkContext.parallelize(Seq(Row(values: _*)))
  val schema = StructType(keys.map(
    k => StructField(k, IntegerType, nullable = false)))
  val df = spark.createDataFrame(rows, schema)
  df.show()

Gives:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   5|   6|  10|
+----+----+----+

The idea is straightforward: convert map to list of tuples, unzip, convert the keys into a schema and the values into a single-entry row RDD, build dataframe from the two pieces (the interface for createDataFrame is a bit strange there, accepts java.util.Lists and kitchen sinks, but doesn't accept the usual scala List for some reason).

like image 109
Andrey Tyukin Avatar answered Jan 12 '23 01:01

Andrey Tyukin