Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert a spark dataframe column, containing serialized json, into a dataframe itself?

Reason I felt this was not a duplicate of this question:

  • from_json requires knowledge of the json schema ex-ante, which I do not have knowledge of
  • get_json_object - I attempted to use this, but the result of running get_json_object is itself a string, leaving me back at square one. Additionally, it appears (from the exprs statement) that - again - the author expects knowledge of the schema ex-ante, and is not inferring the schema.

Requirements:

  • ex-ante, I do not have knowledge of what the json schema is, and thus need to infer it. spark.read.json seems the best case for inferring the schema, but all the examples I came across loaded the json from files. In my use case, the json was contained within a column in a dataframe.

  • I am agnostic to the source file type (in this case, tested with parquet and csv). However, the source dataframe schema is and will be well structured. For my use case, the json is contained within a column in the source dataframe called 'fields'.

  • The resulting dataframe should link to the primary key in the source dataframe ('id' for my example).

like image 366
p5k6 Avatar asked Sep 13 '25 17:09

p5k6


1 Answers

The key turned out to be in the spark source code. path when passed to spark.read.json may be a "RDD of Strings storing json objects".

Here's the source dataframe schema:

The code I came up with was:

def inject_id(row):
    js = json.loads(row['fields'])
    js['id'] = row['id']
    return json.dumps(js)
json_df = spark.read.json(df.rdd.map(inject_id))

json_df then had a schema as such

Note that - I did not test this with a more nested structure, but I believe it will support anything that spark.read.json supports.

like image 164
p5k6 Avatar answered Sep 16 '25 08:09

p5k6