Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle null values when writing to parquet from Spark

Tags:

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

like image 642
WestCoastProjects Avatar asked May 03 '18 17:05

WestCoastProjects


People also ask

How does Parquet handle NULL values?

At the point before the write, the schema's nullability is enforced. But once the DataFrame is written to Parquet, all column nullability flies out the window as one can see with the output of printSchema() from the incoming DataFrame.

How does Spark ignore NULL values?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

Does Spark join NULL values?

Spark SQL supports null ordering specification in ORDER BY clause. Spark processes the ORDER BY clause by placing all the NULL values at first or at last depending on the null ordering specification.

How do you handle NULL values in Scala?

In Scala, using null to represent nullable or missing values is an anti-pattern: use the type Option instead. The type Option ensures that you deal with both the presence and the absence of an element. Thanks to the Option type, you can make your system safer by avoiding nasty NullPointerException s at runtime.


1 Answers

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema root  |-- comments: null (nullable = true) 

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema root |-- comments: double (nullable = true) 

and the result can be safely written to Parquet.

like image 165
Alper t. Turker Avatar answered Sep 22 '22 01:09

Alper t. Turker