Convert Dataframe to a Map(Key-Value) in Spark

Question

So, I have a DataFrame in Spark which looks like this:

It has 30 columns: only showing some of them!

[ABCD,color,NORMAL,N,2015-02-20,1]
[XYZA,color,NORMAL,N,2015-05-04,1]
[GFFD,color,NORMAL,N,2015-07-03,1]
[NAAS,color,NORMAL,N,2015-08-26,1]
[LOWW,color,NORMAL,N,2015-09-26,1]
[KARA,color,NORMAL,N,2015-11-08,1]
[ALEQ,color,NORMAL,N,2015-12-04,1]
[VDDE,size,NORMAL,N,2015-12-23,1]
[QWER,color,NORMAL,N,2016-01-18,1]
[KDSS,color,NORMAL,Y,2015-08-29,1]
[KSDS,color,NORMAL,Y,2015-08-29,1]
[ADSS,color,NORMAL,Y,2015-08-29,1]
[BDSS,runn,NORMAL,Y,2015-08-29,1]
[EDSS,color,NORMAL,Y,2015-08-29,1]

So, I have to convert this dataFrame into a key-Value Pair in Scala, using the key as some of the columns in the Dataframe and assigning unique values to those keys from index 0 to the count(distinct number of keys).

For example: using the case above, I want to have an output in a map(key-value) collection in Scala like this:

    ([ABC_color_NORMAL_N_1->0]
    [XYZA_color_NORMAL_N_1->1]
    [GFFD_color_NORMAL_N_1->2]
    [NAAS_color_NORMAL_N_1->3]
    [LOWW_color_NORMAL_N_1->4]
    [KARA_color_NORMAL_N_1->5]
    [ALEQ_color_NORMAL_N_1->6]
    [VDDE_size_NORMAL_N_1->7]
    [QWER_color_NORMAL_N_1->8]
    [KDSS_color_NORMAL_Y_1->9]
    [KSDS_color_NORMAL_Y_1->10]
    [ADSS_color_NORMAL_Y_1->11]
    [BDSS_runn_NORMAL_Y_1->12]
    [EDSS_color_NORMAL_Y_1->13]
    )

I'm new to Scala and Spark and I tried doing something Like this.

 var map: Map[String, Int] = Map()
    var i = 0
    dataframe.foreach( record =>{
    //Is there a better way of creating a key!
        val key = record(0) + record(1) + record(2) + record(3)
        var index = i
        map += (key -> index)
        i+=1
          }
        )

But, this is not working.:/ The Map is null after this completes.

Tzach Zohar · Accepted Answer

The main issue in your code is trying to modify a variable created on driver-side within code executed on the workers. When using Spark, you can use driver-side variables within RDD transformations only as "read only" values.

Specifically:

The map is created on the driver machine
The map (with its initial, empty value) is serialized and sent to worker nodes
Each node might change the map (locally)
Result is just thrown away when foreach is done - result is not sent back to driver.

To fix this - you should choose a transformation that returns a changed RDD (e.g. map) to create the keys, use zipWithIndex to add the running "ids", and then use collectAsMap to get all the data back to the driver as a Map:

val result: Map[String, Long] = dataframe
  .map(record => record(0) + record(1) + record(2) + record(3))
  .zipWithIndex()
  .collectAsMap()

As for the key creation itself - assuming you want to include first 5 columns, and add a separator (_) between them, you can use:

record => record.toList.take(5).mkString("_")

Convert Dataframe to a Map(Key-Value) in Spark

Tags:

dictionary

scala

apache-spark

Abhinav Bhardwaj

1 Answers

Tzach Zohar

Recent Activity

Donate For Us

Convert Dataframe to a Map(Key-Value) in Spark

Tags:

dictionary

scala

apache-spark

Abhinav Bhardwaj

1 Answers

Tzach Zohar

Related questions

Recent Activity

Donate For Us