Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:

df=spark.sql('select * from <table_name>')


+++++++++++++++++++++++++++++++++++++++++++
|  Name    |    URL visited               |
+++++++++++++++++++++++++++++++++++++++++++
|  person1 | [google,msn,yahoo]           |
|  person2 | [fb.com,airbnb,wired.com]    |
|  person3 | [fb.com,google.com]          |
+++++++++++++++++++++++++++++++++++++++++++

When i tried the following, got an error

df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."

type(df.name) is of 'pyspark.sql.column.Column'

How do i create a dictionary like the following, which can be iterated later on

{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}

Appreciate your thoughts and help.

like image 379
user8946942 Avatar asked Mar 22 '18 15:03

user8946942


People also ask

How do you convert a PySpark DataFrame to a list of dictionaries?

Method 1: Using df.toPandas() Convert the PySpark data frame to Pandas data frame using df. toPandas(). Return type: Returns the pandas data frame having the same content as Pyspark Dataframe. Get through each column value and add the list of values to the dictionary with the column name as the key.

What is asDict PySpark?

If a row contains duplicate field names, e.g., the rows of a join between two DataFrame that both have the fields of same names, one of the duplicate fields will be selected by asDict . __getitem__ will also return one of the duplicate fields, however returned value might be different to asDict .

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


2 Answers

How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.

df_list_of_dict = [row.asDict() for row in df.collect()]

type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)

df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
like image 57
user9074332 Avatar answered Sep 22 '22 01:09

user9074332


I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.

Something like:

df.rdd.map(lambda row: row.asDict())
like image 38
Cosmin Avatar answered Sep 23 '22 01:09

Cosmin