Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert list of dictionaries into Pyspark DataFrame

Tags:

I want to convert my list of dictionaries into DataFrame. This is the list:

mylist =  [   {"type_activity_id":1,"type_activity_name":"xxx"},   {"type_activity_id":2,"type_activity_name":"yyy"},   {"type_activity_id":3,"type_activity_name":"zzz"} ] 

This is my code:

from pyspark.sql.types import StringType  df = spark.createDataFrame(mylist, StringType())  df.show(2,False)  +-----------------------------------------+ |                                    value| +-----------------------------------------+ |{type_activity_id=1,type_activity_id=xxx}| |{type_activity_id=2,type_activity_id=yyy}| |{type_activity_id=3,type_activity_id=zzz}| +-----------------------------------------+ 

I assume that I should provide some mapping and types for each column, but I don't know how to do it.

Update:

I also tried this:

schema = ArrayType(     StructType([StructField("type_activity_id", IntegerType()),                 StructField("type_activity_name", StringType())                 ])) df = spark.createDataFrame(mylist, StringType()) df = df.withColumn("value", from_json(df.value, schema)) 

But then I get null values:

+-----+ |value| +-----+ | null| | null| +-----+ 
like image 584
Markus Avatar asked Sep 08 '18 19:09

Markus


People also ask

How do I convert a dictionary to a DataFrame in PySpark?

Solution 1 - Infer schema from dict In Spark 2. x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession. createDataFrame function.

How do I make a Pandas DataFrame from a list of dictionaries?

Use pd. DataFrame. from_dict() to transform a list of dictionaries to pandas DatFrame. This function is used to construct DataFrame from dict of array-like or dicts.

How do I convert a list of dictionaries in Python?

Since python dictionary is unordered, the output can be in any order. To convert a list to dictionary, we can use list comprehension and make a key:value pair of consecutive elements. Finally, typecase the list to dict type.


2 Answers

In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:

mylist = [   {"type_activity_id":1,"type_activity_name":"xxx"},   {"type_activity_id":2,"type_activity_name":"yyy"},   {"type_activity_id":3,"type_activity_name":"zzz"} ] df = spark.createDataFrame(mylist) #UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead #  warnings.warn("inferring schema from dict is deprecated," 

As this warning message says, you should use pyspark.sql.Row instead.

from pyspark.sql import Row spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False) #+----------------+------------------+ #|type_activity_id|type_activity_name| #+----------------+------------------+ #|1               |xxx               | #|2               |yyy               | #|3               |zzz               | #+----------------+------------------+ 

Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.

like image 167
pault Avatar answered Sep 22 '22 12:09

pault


You can do it like this. You will get a dataframe with 2 columns.

mylist = [   {"type_activity_id":1,"type_activity_name":"xxx"},   {"type_activity_id":2,"type_activity_name":"yyy"},   {"type_activity_id":3,"type_activity_name":"zzz"} ]  myJson = sc.parallelize(mylist) myDf = sqlContext.read.json(myJson) 

Output :

+----------------+------------------+ |type_activity_id|type_activity_name| +----------------+------------------+ |               1|               xxx| |               2|               yyy| |               3|               zzz| +----------------+------------------+ 
like image 43
pissall Avatar answered Sep 25 '22 12:09

pissall