Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a dictionary type column in dataframe

Consider the following dataframe:

------------+--------------------+
|id|          values
+------------+--------------------+
|          39|a,a,b,b,c,c,c,c,d
|         520|a,b,c
|         832|a,a

I want to convert it into the following DataFrame:

------------+--------------------+
|id|          values
+------------+--------------------+
|          39|{"a":2, "b": 2,"c": 4,"d": 1}
|         520|{"a": 1,"b": 1,"c": 1}
|         832|{"a": 2}

I tried two approaches:

  1. Converting the dataframe to rdd. Then I mapped the value column to a frequancy counter function. But I get errors on converting the rdd back to the dataframe

  2. Using a udf to essentially do the same thing as above.

The reason I want to have a dictionary column is to load it as a json in one of my python application.

like image 386
futurenext110 Avatar asked Jul 13 '16 00:07

futurenext110


People also ask

How do I turn a column into a DataFrame dictionary?

To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.

Can a Pandas column be a dictionary?

Pandas Columns to Dictionary with Pandas' to_dict() function It uses column names as keys and the column values as values. It creates a dictionary for column values using the index as keys.

How do you create a dictionary DataFrame?

We can convert a dictionary to a pandas dataframe by using the pd. DataFrame. from_dict() class-method.

Can you put a dictionary in a Pandas DataFrame?

A pandas DataFrame can be converted into a Python dictionary using the DataFrame instance method to_dict(). The output can be specified of various orientations using the parameter orient. In dictionary orientation, for each column of the DataFrame the column value is listed against the row label in a dictionary.


1 Answers

You can do this with a udf that returns a MapType column.

from pyspark.sql.types import MapType, StringType, IntegerType
from collections import Counter

my_udf = udf(lambda s: dict(Counter(s.split(','))), MapType(StringType(), IntegerType()))
df = df.withColumn('values', my_udf('values'))
df.collect()

[Row(id=39, values={u'a': 2, u'c': 4, u'b': 2, u'd': 1}),
 Row(id=520, values={u'a': 1, u'c': 1, u'b': 1}),
 Row(id=832, values={u'a': 2})]
like image 137
dfernig Avatar answered Sep 28 '22 18:09

dfernig