Until recently parquet
did not support null
values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark
supports that new parquet
feature - if ever. Here is the associated (closed - will not fix
) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe
's to parquet
? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null
- short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).
At the point before the write, the schema's nullability is enforced. But once the DataFrame is written to Parquet, all column nullability flies out the window as one can see with the output of printSchema() from the incoming DataFrame.
In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
Spark SQL supports null ordering specification in ORDER BY clause. Spark processes the ORDER BY clause by placing all the NULL values at first or at last depending on the null ordering specification.
In Scala, using null to represent nullable or missing values is an anti-pattern: use the type Option instead. The type Option ensures that you deal with both the presence and the absence of an element. Thanks to the Option type, you can make your system safer by avoiding nasty NullPointerException s at runtime.
You misinterpreted SPARK-10943. Spark does support writing null
values to numeric columns.
The problem is that null
alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema root |-- comments: null (nullable = true)
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema root |-- comments: double (nullable = true)
and the result can be safely written to Parquet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With