I have a list of nested dictionaries, e.g. ds = [{'a': {'b': {'c': 1}}}]
and want to create a spark DataFrame from it while inferring schema of nested dictionaries. Using sqlContext.createDataFrame(ds).printSchema()
gives me following schema
root
|-- a: map (nullable = true)
| |-- key: string
| |-- value: map (valueContainsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
but what I need is this
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
The second schema can be created by first converting dictionaries to JSON and then load it with jsonRDD
like this sqlContext.jsonRDD(sc.parallelize([json.dumps(ds[0])])).printSchema()
. But this would be quite cumbersome for large files.
I thought about converting dictionaries to pyspark.sql.Row()
objects hoping that dataframe will infer the schema, but it didn't work when dictionaries had different schemas (e.g. first was missing some key).
Is there any other way to do this? Thanks!
I think this will help.
import json
ds = [{'a': {'b': {'c': 1}}}]
ds2 = [json.dumps(item) for item in ds]
df = sqlCtx.jsonRDD(sc.parallelize(ds2))
df.printSchema()
Then,
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With