Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

map values in a dataframe from a dictionary using pyspark

I want to know how to map values in a specific column in a dataframe.

I have a dataframe which looks like:

df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])

+-----+-------+
| col1|   col2|
+-----+-------+
|india|  japan|
|  usa|uruguay|
+-----+-------+

I have a dictionary from where I want to map the values.

dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])

The output I want is:

+-----+-------+--------+--------+
| col1|   col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india|  japan|     ind|     jpn|
|  usa|uruguay|      us|     urg|
+-----+-------+--------+--------+

I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:

def map_val(x):
    return dicts.lookup(x)[0]

myfun = udf(lambda x: map_val(x), StringType())

df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work
like image 815
YOLO Avatar asked May 13 '18 23:05

YOLO


People also ask

How do I make a PySpark DataFrame from the dictionary?

To do this spark. createDataFrame() method method is used. This method takes two argument data and columns. The data attribute will contain the dataframe and the columns attribute will contain the list of columns name.

How do I map a column in PySpark?

Solution: PySpark SQL function create_map() is used to convert selected DataFrame columns to MapType , create_map() takes a list of columns you wanted to convert as an argument and returns a MapType column.


1 Answers

I think the easier way is just to use a simple dictionary and df.withColumn.

from itertools import chain
from pyspark.sql.functions import create_map, lit

simple_dict = {'india':'ind', 'usa':'us', 'japan':'jpn', 'uruguay':'urg'}

mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])

df = df.withColumn('col1_map', mapping_expr[df['col1']])\
       .withColumn('col2_map', mapping_expr[df['col2']])

df.show(truncate=False)
like image 57
Ali AzG Avatar answered Sep 19 '22 18:09

Ali AzG