I want to convert my list of dictionaries into DataFrame. This is the list:
mylist = [ {"type_activity_id":1,"type_activity_name":"xxx"}, {"type_activity_id":2,"type_activity_name":"yyy"}, {"type_activity_id":3,"type_activity_name":"zzz"} ]
This is my code:
from pyspark.sql.types import StringType df = spark.createDataFrame(mylist, StringType()) df.show(2,False) +-----------------------------------------+ | value| +-----------------------------------------+ |{type_activity_id=1,type_activity_id=xxx}| |{type_activity_id=2,type_activity_id=yyy}| |{type_activity_id=3,type_activity_id=zzz}| +-----------------------------------------+
I assume that I should provide some mapping and types for each column, but I don't know how to do it.
Update:
I also tried this:
schema = ArrayType( StructType([StructField("type_activity_id", IntegerType()), StructField("type_activity_name", StringType()) ])) df = spark.createDataFrame(mylist, StringType()) df = df.withColumn("value", from_json(df.value, schema))
But then I get null
values:
+-----+ |value| +-----+ | null| | null| +-----+
Solution 1 - Infer schema from dict In Spark 2. x, schema can be directly inferred from dictionary. The following code snippets directly create the data frame using SparkSession. createDataFrame function.
Use pd. DataFrame. from_dict() to transform a list of dictionaries to pandas DatFrame. This function is used to construct DataFrame from dict of array-like or dicts.
Since python dictionary is unordered, the output can be in any order. To convert a list to dictionary, we can use list comprehension and make a key:value pair of consecutive elements. Finally, typecase the list to dict type.
In the past, you were able to simply pass a dictionary to spark.createDataFrame()
, but this is now deprecated:
mylist = [ {"type_activity_id":1,"type_activity_name":"xxx"}, {"type_activity_id":2,"type_activity_name":"yyy"}, {"type_activity_id":3,"type_activity_name":"zzz"} ] df = spark.createDataFrame(mylist) #UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead # warnings.warn("inferring schema from dict is deprecated,"
As this warning message says, you should use pyspark.sql.Row
instead.
from pyspark.sql import Row spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False) #+----------------+------------------+ #|type_activity_id|type_activity_name| #+----------------+------------------+ #|1 |xxx | #|2 |yyy | #|3 |zzz | #+----------------+------------------+
Here I used **
(keyword argument unpacking) to pass the dictionaries to the Row
constructor.
You can do it like this. You will get a dataframe with 2 columns.
mylist = [ {"type_activity_id":1,"type_activity_name":"xxx"}, {"type_activity_id":2,"type_activity_name":"yyy"}, {"type_activity_id":3,"type_activity_name":"zzz"} ] myJson = sc.parallelize(mylist) myDf = sqlContext.read.json(myJson)
Output :
+----------------+------------------+ |type_activity_id|type_activity_name| +----------------+------------------+ | 1| xxx| | 2| yyy| | 3| zzz| +----------------+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With