How can I convert a spark dataframe column, containing serialized json, into a dataframe itself?

Question

Reason I felt this was not a duplicate of this question:

from_json requires knowledge of the json schema ex-ante, which I do not have knowledge of
get_json_object - I attempted to use this, but the result of running get_json_object is itself a string, leaving me back at square one. Additionally, it appears (from the exprs statement) that - again - the author expects knowledge of the schema ex-ante, and is not inferring the schema.

Requirements:

ex-ante, I do not have knowledge of what the json schema is, and thus need to infer it. spark.read.json seems the best case for inferring the schema, but all the examples I came across loaded the json from files. In my use case, the json was contained within a column in a dataframe.
I am agnostic to the source file type (in this case, tested with parquet and csv). However, the source dataframe schema is and will be well structured. For my use case, the json is contained within a column in the source dataframe called 'fields'.
The resulting dataframe should link to the primary key in the source dataframe ('id' for my example).

p5k6 · Accepted Answer

The key turned out to be in the spark source code. path when passed to spark.read.json may be a "RDD of Strings storing json objects".

Here's the source dataframe schema:

The code I came up with was:

def inject_id(row):
    js = json.loads(row['fields'])
    js['id'] = row['id']
    return json.dumps(js)
json_df = spark.read.json(df.rdd.map(inject_id))

json_df then had a schema as such

Note that - I did not test this with a more nested structure, but I believe it will support anything that spark.read.json supports.

How can I convert a spark dataframe column, containing serialized json, into a dataframe itself?

Tags:

json

apache-spark

pyspark

p5k6

1 Answers

p5k6

Recent Activity

Donate For Us

How can I convert a spark dataframe column, containing serialized json, into a dataframe itself?

Tags:

json

apache-spark

pyspark

p5k6

1 Answers

p5k6

Related questions

Recent Activity

Donate For Us