How to convert rows into a list of dictionaries in pyspark?

Tags:

I have a DataFrame(df) in pyspark, by reading from a hive table:

df=spark.sql('select * from <table_name>')


+++++++++++++++++++++++++++++++++++++++++++
|  Name    |    URL visited               |
+++++++++++++++++++++++++++++++++++++++++++
|  person1 | [google,msn,yahoo]           |
|  person2 | [fb.com,airbnb,wired.com]    |
|  person3 | [fb.com,google.com]          |
+++++++++++++++++++++++++++++++++++++++++++

When i tried the following, got an error

Click to copy

df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."

type(df.name) is of 'pyspark.sql.column.Column'

How do i create a dictionary like the following, which can be iterated later on

Click to copy

{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}

Appreciate your thoughts and help.

379

asked Mar 22 '18 15:03

user8946942

2 Answers

How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.

Click to copy

df_list_of_dict = [row.asDict() for row in df.collect()]

type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)

df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]

answered Sep 22 '22 01:09

user9074332

I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.

Something like:

Click to copy

df.rdd.map(lambda row: row.asDict())

answered Sep 23 '22 01:09

Cosmin

Related questions
                            
                                Check Type: How to check if something is a RDD or a DataFrame?
                            
                                How to fix spark-shell on Windows (fails with "was unexpected at this time")? [closed]
                            
                                No module named 'resource' installing Apache Spark on Windows
                            
                                how to check if a string column in pyspark dataframe is all numeric
                            
                                Spark: How to save a dataframe with headers?
                            
                                How to convert a table into a Spark Dataframe
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/Logging
                            
                                TaskSchedulerImpl: Initial job has not accepted any resources;
                            
                                ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]
                            
                                Count number of words in a spark dataframe
                            
                                Spark 2: how does it work when SparkSession enableHiveSupport() is invoked
                            
                                Mock a Spark RDD in the unit tests
                            
                                How to Join Multiple Columns in Spark SQL using Java for filtering in DataFrame
                            
                                spark.sql.crossJoin.enabled for Spark 2.x
                            
                                PySpark: Absolute value of a column. TypeError: a float is required
                            
                                How to redirect entire output of spark-submit to a file
                            
                                Spark SQL performing carthesian join instead of inner join
                            
                                filter DataFrame with Regex with Spark in Scala
                            
                                Why agg() in PySpark is only able to summarize one column at a time? [duplicate]
                            
                                How to export DataFrame to csv in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert rows into a list of dictionaries in pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

user8946942

People also ask

2 Answers

user9074332

Cosmin

Recent Activity

Donate For Us