I have a dataframe that I read from a csv.
CSV:
name,age,pets
Alice,23,dog
Bob,30,dog
Charlie,35,
Reading this into a DataFrame called myData:
+-------+---+----+
| name|age|pets|
+-------+---+----+
| Alice| 23| dog|
| Bob| 30| dog|
|Charlie| 35|null|
+-------+---+----+
Now, I want to convert each row of this dataframe to a json using myData.toJSON
. What I get are the following jsons.
{"name":"Alice","age":"23","pets":"dog"}
{"name":"Bob","age":"30","pets":"dog"}
{"name":"Charlie","age":"35"}
I would like the 3rd row's json to include the null value. Ex.
{"name":"Charlie","age":"35", "pets":null}
However, this doesn't seem to be possible. I debugged through the code and saw that Spark's org.apache.spark.sql.catalyst.json.JacksonGenerator
class has the following implementation
private def writeFields(
row: InternalRow, schema: StructType, fieldWriters:
Seq[ValueWriter]): Unit = {
var i = 0
while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
gen.writeFieldName(field.name)
fieldWriters(i).apply(row, i)
}
i += 1
}
}
This seems to be skipping a column if it is null. I am not quite sure why this is the default behavior but is there a way to print null values in json using Spark's toJSON
?
I am using Spark 2.1.0
In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.
In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
PySpark fillna() & fill() – Replace NULL/None Values. In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.
To print the null values in JSON using Spark's toJSON
method, you can use following code:
myData.na.fill("null").toJSON
It will give you expected result:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"name":"Alice","age":"23","pets":"dog"} |
|{"name":"Bob","age":"30","pets":"dog"} |
|{"name":"Charlie","age":"35","pets":"null"}|
+-------------------------------------------+
I hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With