Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create Spark DataFrame from nested dictionary

I have a list of nested dictionaries, e.g. ds = [{'a': {'b': {'c': 1}}}] and want to create a spark DataFrame from it while inferring schema of nested dictionaries. Using sqlContext.createDataFrame(ds).printSchema() gives me following schema

root
 |-- a: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: long (valueContainsNull = true)

but what I need is this

root
 |-- a: struct (nullable = true)
 |    |-- b: struct (nullable = true)
 |    |    |-- c: long (nullable = true)

The second schema can be created by first converting dictionaries to JSON and then load it with jsonRDD like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema(). But this would be quite cumbersome for large files.

I thought about converting dictionaries to pyspark.sql.Row() objects hoping that dataframe will infer the schema, but it didn't work when dictionaries had different schemas (e.g. first was missing some key).

Is there any other way to do this? Thanks!

like image 984
Marigold Avatar asked Apr 21 '15 11:04

Marigold


1 Answers

I think this will help.

import json
ds = [{'a': {'b': {'c': 1}}}]
ds2 = [json.dumps(item) for item in ds]
df = sqlCtx.jsonRDD(sc.parallelize(ds2))
df.printSchema()

Then,

root
|-- a: struct (nullable = true)
|    |-- b: struct (nullable = true)
|    |    |-- c: long (nullable = true)
like image 113
hyim Avatar answered Nov 29 '22 14:11

hyim