Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I print nulls when converting a dataframe to json in Spark

I have a dataframe that I read from a csv.

CSV:
name,age,pets
Alice,23,dog
Bob,30,dog
Charlie,35,

Reading this into a DataFrame called myData:
+-------+---+----+
|   name|age|pets|
+-------+---+----+
|  Alice| 23| dog|
|    Bob| 30| dog|
|Charlie| 35|null|
+-------+---+----+

Now, I want to convert each row of this dataframe to a json using myData.toJSON. What I get are the following jsons.

{"name":"Alice","age":"23","pets":"dog"}
{"name":"Bob","age":"30","pets":"dog"}
{"name":"Charlie","age":"35"}

I would like the 3rd row's json to include the null value. Ex.

{"name":"Charlie","age":"35", "pets":null}

However, this doesn't seem to be possible. I debugged through the code and saw that Spark's org.apache.spark.sql.catalyst.json.JacksonGenerator class has the following implementation

  private def writeFields(
    row: InternalRow, schema: StructType, fieldWriters: 
    Seq[ValueWriter]): Unit = {
    var i = 0
    while (i < row.numFields) {
      val field = schema(i)
      if (!row.isNullAt(i)) {
        gen.writeFieldName(field.name)
        fieldWriters(i).apply(row, i)
      }
      i += 1
    }
  }

This seems to be skipping a column if it is null. I am not quite sure why this is the default behavior but is there a way to print null values in json using Spark's toJSON?

I am using Spark 2.1.0

like image 360
Rahul Avatar asked Aug 11 '17 03:08

Rahul


People also ask

How do you deal with nulls in Spark?

In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.

How do I remove null values from a DataFrame Spark?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do I change NULL values in Spark DataFrame PySpark?

PySpark fillna() & fill() – Replace NULL/None Values. In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.


1 Answers

To print the null values in JSON using Spark's toJSON method, you can use following code:

myData.na.fill("null").toJSON

It will give you expected result:

+-------------------------------------------+
|value                                      |
+-------------------------------------------+
|{"name":"Alice","age":"23","pets":"dog"}   |
|{"name":"Bob","age":"30","pets":"dog"}     |
|{"name":"Charlie","age":"35","pets":"null"}|
+-------------------------------------------+

I hope it helps!

like image 145
himanshuIIITian Avatar answered Oct 23 '22 12:10

himanshuIIITian