I am trying to write a JSON file using spark. There are some keys that have null
as value. These show up just fine in the DataSet
, but when I write the file, the keys get dropped. How do I ensure they are retained?
code to write the file:
ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
part of JSON data from source:
"event_header": {
"accept_language": null,
"app_id": "App_ID",
"app_name": null,
"client_ip_address": "IP",
"event_id": "ID",
"event_timestamp": null,
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
Output:
"event_header": {
"app_id": "App_ID",
"client_ip_address": "IP",
"event_id": "ID",
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
In the above example keys accept_language
, app_name
and event_timestamp
have been dropped.
JSON object keys must be strings according to the specification. Therefore null is not allowed as JSON object key. So the reason it fails is because what you are returning can not be serialized to a valid JSON structure.
JSON has a special value called null which can be set on any type of data including arrays, objects, number and boolean types.
To include null values in the JSON output of the FOR JSON clause, specify the INCLUDE_NULL_VALUES option. If you don't specify the INCLUDE_NULL_VALUES option, the JSON output doesn't include properties for values that are null in the query results.
In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
If you are on Spark 3, you can add
spark.sql.jsonGenerator.ignoreNullFields false
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With