Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read and write empty string "" vs NULL in Spark 2.0.1

CSVFileFormat seems to read and write empty values as null for string columns. I have searched around but have been unable to find clear information about this, so I put together a simple test.

val df = session.createDataFrame(Seq(
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, ""),
    (4, null)
))  

df.coalesce(1).write.mode("overwrite").format("csv") 
    .option("delimiter", ",")
    .option("nullValue", "unknown")
    .option("treatEmptyValuesAsNulls", "false")
    .save(s"$path/test")

This outputs:

0,a
1,b
2,c
3,unknown
4,unknown

So, it appears to be treating both empty strings and null values as null. The same thing happens when reading a CSV file with empty quoted strings and nulls. Is there currently any way to treat these differently?

like image 986
Kyro Avatar asked Dec 09 '16 22:12

Kyro


People also ask

Which is better empty string or null?

So it is better to use empty string for database as allowing NULL value forces the system to do extra work and does not give the data that you are looking for. However, NULL does not throw any exception when the count() function is executed.

Is string empty and null the same?

The Java programming language distinguishes between null and empty strings. An empty string is a string instance of zero length, whereas a null string has no value at all. An empty string is represented as "" . It is a character sequence of zero characters.

Is empty string null in PySpark?

In Spark 2.0. 0, blank values were read into DataFrames as empty strings and empty strings were read into DataFrames as null .

How do you deal with nulls in Spark?

You can keep null values out of certain columns by setting nullable to false . You won't be able to set nullable to false for all columns in a DataFrame and pretend like null values don't exist. For example, when joining DataFrames, the join column will return null when a match cannot be made.


1 Answers

A mere two and a half years later, empty strings are no longer considered equal to null values thanks to Spark 2.4.0! See this commit for a bit of detail on functionality. Your code will behave as expected under 2.4.0+:

val df = session.createDataFrame(Seq(
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, ""),
    (4, null)
))  

df.coalesce(1).write.mode("overwrite").format("csv") 
    .option("delimiter", ",")
    .option("nullValue", "unknown")
    .option("treatEmptyValuesAsNulls", "false")
    .save(s"$path/test")

Results in:

0,a
1,b
2,c
3,
4,unknown
like image 167
bsplosion Avatar answered Sep 29 '22 16:09

bsplosion