I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
Method 1: Using df.toPandas() Convert the PySpark data frame to Pandas data frame using df. toPandas(). Return type: Returns the pandas data frame having the same content as Pyspark Dataframe. Get through each column value and add the list of values to the dictionary with the column name as the key.
If a row contains duplicate field names, e.g., the rows of a join between two DataFrame that both have the fields of same names, one of the duplicate fields will be selected by asDict . __getitem__ will also return one of the duplicate fields, however returned value might be different to asDict .
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
How about using the pyspark Row.as_Dict()
method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
I think you can try row.asDict()
, this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With