Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retain keys with null values while writing JSON in spark

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?

code to write the file:

ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");

part of JSON data from source:

"event_header": {
        "accept_language": null,
        "app_id": "App_ID",
        "app_name": null,
        "client_ip_address": "IP",
        "event_id": "ID",
        "event_timestamp": null,
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }

Output:

"event_header": {
        "app_id": "App_ID",
        "client_ip_address": "IP",
        "event_id": "ID",
        "offering_id": "Offering",
        "server_ip_address": "IP",
        "server_timestamp": 1492565987565,
        "topic_name": "Topic",
        "version": "1.0"
    }

In the above example keys accept_language, app_name and event_timestamp have been dropped.

like image 461
Vaishak Suresh Avatar asked May 30 '17 20:05

Vaishak Suresh


People also ask

Can a key be null in JSON?

JSON object keys must be strings according to the specification. Therefore null is not allowed as JSON object key. So the reason it fails is because what you are returning can not be serialized to a valid JSON structure.

Can JSON store null values?

JSON has a special value called null which can be set on any type of data including arrays, objects, number and boolean types.

How do I allow null values in JSON?

To include null values in the JSON output of the FOR JSON clause, specify the INCLUDE_NULL_VALUES option. If you don't specify the INCLUDE_NULL_VALUES option, the JSON output doesn't include properties for values that are null in the query results.

How does Spark ignore null values?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.


1 Answers

If you are on Spark 3, you can add

spark.sql.jsonGenerator.ignoreNullFields false
like image 183
mani_nz Avatar answered Sep 28 '22 08:09

mani_nz