Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?
[{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]
How can i use the following construct to do it?
df = sc.parallelize([
...
]).toDF
Where to place arg1 arg2 in the above code (...)
To do this spark. createDataFrame() method method is used. This method takes two argument data and columns. The data attribute will contain the dataframe and the columns attribute will contain the list of columns name.
To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.
We can convert a dictionary to a pandas dataframe by using the pd. DataFrame. from_dict() class-method.
Old way:
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()
New way:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
.map(convert_to_row) \
.toDF()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With