Retain raw JSON as column in Spark DataFrame on read/load?

Tags:

I have been looking for a way to add my raw (JSON) data as a column when reading my data into a Spark DataFrame. I have one way to do this with a join but am hoping there is a way to do this in a single operation using Spark 2.2.x+.

So for example data:

{"team":"Golden Knights","colors":"gold,red,black","origin":"Las Vegas"}
{"team":"Sharks","origin": "San Jose", "eliminated":"true"}
{"team":"Wild","colors":"red,green,gold","origin":"Minnesota"}

When executing:

val logs = sc.textFile("/Users/vgk/data/tiny.json") // example data file
spark.read.json(logs).show

Predictably we get:

+--------------+----------+--------------------+--------------+
|        colors|eliminated|              origin|          team|
+--------------+----------+--------------------+--------------+
|gold,red,black|      null|           Las Vegas|Golden Knights|
|          null|      true|            San Jose|        Sharks|
|red,green,gold|      null|           Minnesota|          Wild|
|red,white,blue|     false|District of Columbia|      Capitals|
+--------------+----------+--------------------+--------------+

What I'd like to have on initial load is the above, but with the raw JSON data as an additional column. For example (truncated raw values):

+--------------+-------------------------------+--------------+--------------------+
|        colors|eliminated|              origin|          team|               value|
+--------------+----------+--------------------+--------------+--------------------+
|red,white,blue|     false|District of Columbia|      Capitals|{"colors":"red,wh...|
|gold,red,black|      null|           Las Vegas|Golden Knights|{"colors":"gold,r...|
|          null|      true|            San Jose|        Sharks|{"eliminated":"tr...|
|red,green,gold|      null|           Minnesota|          Wild|{"colors":"red,gr...|
+--------------+----------+--------------------+--------------+--------------------+

A non-ideal solution involves a join:

val logs = sc.textFile("/Users/vgk/data/tiny.json")
val df = spark.read.json(logs).withColumn("uniqueID",monotonically_increasing_id)
val rawdf = df.toJSON.withColumn("uniqueID",monotonically_increasing_id)
df.join(rawdf, "uniqueID")

Which results in the same dataframe as above but with and added uniqueID column. Additionally, the json is rendered from the DF and is not necessarily the "raw" data. In practice they are equal, but for my use case the actual raw data is preferable.

Is anyone aware of a solution that will capture the raw JSON data as an additional column on load?

378

asked May 07 '18 15:05

reverend

1 Answers

If you have a schema of the data that you receive, then you can use from_json with schema to get all the fields and keep the raw field as it is

val logs = spark.sparkContext.textFile(path) // example data file

val schema = StructType(
  StructField("team", StringType, true)::
  StructField("colors", StringType, true)::
  StructField("eliminated", StringType, true)::
  StructField("origin", StringType, true)::Nil
)

logs.toDF("values")
    .withColumn("json", from_json($"values", schema))
    .select("values", "json.*")

    .show(false)

Output:

+------------------------------------------------------------------------+--------------+--------------+----------+---------+
|values                                                                  |team          |colors        |eliminated|origin   |
+------------------------------------------------------------------------+--------------+--------------+----------+---------+
|{"team":"Golden Knights","colors":"gold,red,black","origin":"Las Vegas"}|Golden Knights|gold,red,black|null      |Las Vegas|
|{"team":"Sharks","origin": "San Jose", "eliminated":"true"}             |Sharks        |null          |true      |San Jose |
|{"team":"Wild","colors":"red,green,gold","origin":"Minnesota"}          |Wild          |red,green,gold|null      |Minnesota|
+------------------------------------------------------------------------+--------------+--------------+----------+---------+

Hope his helps!

124

answered Nov 11 '22 18:11

koiralo

Related questions
                            
                                Using PostgreSQL, how do I escape "\" in json columns?
                            
                                Disabling null type in Newtonsoft JSON.NET schema
                            
                                How to upload a json file in mongodb using Java?
                            
                                jackson.databind no such method errors
                            
                                How to successfully parse the output of FFMpeg in NodeJS
                            
                                NLog: logging an object serialized to JSON
                            
                                How to check are there JSON Functions by SQL query?
                            
                                Remove Backslash from JSON string?
                            
                                Put Items using Json File in AWS DynamoDB using AWS CLI
                            
                                How I can round digit on the last column to 2 decimal after a dot using JQ?
                            
                                Replace array element within JSON hash with content from other file
                            
                                How to safely render JSON into an inline <script> using Nunjucks?
                            
                                JSON.parse allows null as value
                            
                                extracting a variable from json output then debug and register the outout with ansible
                            
                                GSON Deserialize String or String Array
                            
                                Formatting training dataset for SpaCy NER
                            
                                Pandas read_json() fails with a simple JSON string
                            
                                net core 2.0 read file added as resources
                            
                                Specifying Jackson JSON subtypes on something other than the base class due to circular dependency
                            
                                Can't Serialize a class extending DynamicObject into JSON string.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retain raw JSON as column in Spark DataFrame on read/load?

Tags:

json

apache-spark

apache-spark-sql

reverend

People also ask

1 Answers

koiralo

Recent Activity

Donate For Us