Create Spark DataFrame from nested dictionary

Question

I have a list of nested dictionaries, e.g. ds = [{'a': {'b': {'c': 1}}}] and want to create a spark DataFrame from it while inferring schema of nested dictionaries. Using sqlContext.createDataFrame(ds).printSchema() gives me following schema

root
 |-- a: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: long (valueContainsNull = true)

but what I need is this

root
 |-- a: struct (nullable = true)
 |    |-- b: struct (nullable = true)
 |    |    |-- c: long (nullable = true)

The second schema can be created by first converting dictionaries to JSON and then load it with jsonRDD like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema(). But this would be quite cumbersome for large files.

I thought about converting dictionaries to pyspark.sql.Row() objects hoping that dataframe will infer the schema, but it didn't work when dictionaries had different schemas (e.g. first was missing some key).

Is there any other way to do this? Thanks!

hyim · Accepted Answer

I think this will help.

import json
ds = [{'a': {'b': {'c': 1}}}]
ds2 = [json.dumps(item) for item in ds]
df = sqlCtx.jsonRDD(sc.parallelize(ds2))
df.printSchema()

Then,

root
|-- a: struct (nullable = true)
|    |-- b: struct (nullable = true)
|    |    |-- c: long (nullable = true)

Create Spark DataFrame from nested dictionary

Tags:

apache-spark

pyspark

Marigold

1 Answers

hyim

Recent Activity

Donate For Us

Create Spark DataFrame from nested dictionary

Tags:

apache-spark

pyspark

Marigold

1 Answers

hyim

Related questions

Recent Activity

Donate For Us