Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark create dictionary from data in two columns

Tags:

python

pyspark

I have a pyspark dataframe with two columns:

[Row(zip_code='58542', dma='MIN'),
 Row(zip_code='58701', dma='MIN'),
 Row(zip_code='57632', dma='MIN'),
 Row(zip_code='58734', dma='MIN')]

How can I make a key:value pair out of the data inside the columns?

e.g.:

{
 "58542":"MIN",
 "58701:"MIN",
 etc..
}

I would like to avoid using collect for performance reasons. I've tried a few things but can't seem to get just the values.

like image 723
too_many_questions Avatar asked Sep 04 '18 19:09

too_many_questions


People also ask

How do you create a Dictionary with two columns?

To create a dictionary from two column values, we first create a Pandas series with the column for keys as index and the other column as values. And then we can apply Pandas' to_dict() function to get dictionary.

How do you create a Dictionary from a PySpark DataFrame?

Method 1: Using df.toPandas() Convert the PySpark data frame to Pandas data frame using df. toPandas(). Return type: Returns the pandas data frame having the same content as Pyspark Dataframe. Get through each column value and add the list of values to the dictionary with the column name as the key.

How do you convert a DataFrame to a Dictionary?

Use DataFrame. To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.


3 Answers

You can simply do this:

dict = {row['zipcode']:row['dma'] for row in df.collect()}
print(dict)
#{'58542': 'MIN', '58701': 'MIN', '57632': 'MIN', '58734': 'MIN'}
like image 129
BICube Avatar answered Sep 19 '22 11:09

BICube


You can avoid using a udf here using pyspark.sql.functions.struct and pyspark.sql.functions.to_json (Spark version 2.1 and above):

import pyspark.sql.functions as f
from pyspark.sql import Row

data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN'),
    Row(zip_code='58734', dma='MIN')
]

df = spark.createDataFrame(data)

df.withColumn("json", f.to_json(f.struct("dma", "zip_code"))).show(truncate=False)
#+---+--------+--------------------------------+
#|dma|zip_code|json                            |
#+---+--------+--------------------------------+
#|MIN|58542   |{"dma":"MIN","zip_code":"58542"}|
#|MIN|58701   |{"dma":"MIN","zip_code":"58701"}|
#|MIN|57632   |{"dma":"MIN","zip_code":"57632"}|
#|MIN|58734   |{"dma":"MIN","zip_code":"58734"}|
#+---+--------+--------------------------------+

If you instead wanted the zip_code to be the key, you can create a MapType directly using pyspark.sql.functions.create_map:

df.withColumn("json", f.create_map(["zip_code", "dma"])).show(truncate=False)
#+---+--------+-----------------+
#|dma|zip_code|json             |
#+---+--------+-----------------+
#|MIN|58542   |Map(58542 -> MIN)|
#|MIN|58701   |Map(58701 -> MIN)|
#|MIN|57632   |Map(57632 -> MIN)|
#|MIN|58734   |Map(58734 -> MIN)|
#+---+--------+-----------------+
like image 28
pault Avatar answered Sep 18 '22 11:09

pault


There is one more way to convert your dataframe into dict. for that you need to convert your dataframe into key-value pair rdd as it will be applicable only to key-value pair rdd. since dictionary itself a combination of key value pairs.

data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN'),
    Row(zip_code='58734', dma='MIN')
]

>>> data.show();
+---+--------+
|dma|zip_code|
+---+--------+
|MIN|   58542|
|MIN|   58701|
|MIN|   57632|
|MIN|   58734|
+---+--------+

converting your dataframe into rdd.

newrdd = data.rdd

since you want zip_code as your key and dma as value, so have selected rdd element '1' as key and element '0' as value.

keypair_rdd = newrdd.map(lambda x : (x[1],x[0]))

once you have key-pair rdd then simply use collectAsMap to convert it into a dictonary

>>> dict = keypair_rdd.collectAsMap()
>>> print dict
{u'58542': u'MIN', u'57632': u'MIN', u'58734': u'MIN', u'58701': u'MIN'}

>>> dict.keys()
[u'58542', u'57632', u'58734', u'58701']

looks value for specific key:

>>> dict.get('58542')
u'MIN'
like image 21
vikrant rana Avatar answered Sep 18 '22 11:09

vikrant rana