I have a pyspark Dataframe and I need to convert this into python dictionary.
Below code is reproducible:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary.
I tried like this
df.set_index('name').to_dict()
But it gives error. How can I achieve this
Please see the example below:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt
:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. Then we convert the lines to columns by splitting on the comma. Then we convert the native RDD to a DF and add names to the colume. Finally we convert to columns to the appropriate format.
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. We convert the Row
object to a dictionary using the asDict()
method. In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten.
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver.
Hope this helps, cheers.
You need to first convert to a pandas.DataFrame
using toPandas()
, then you can use the to_dict()
method on the transposed dataframe with orient='list'
:
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With