Until recently <code>parquet</code> did not support <code>null</code> values - a questionable premise. In fact a recent version did finally add that support: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md However it will be a long time before <code>spark</code> supports that new <code>parquet</code> feature - if ever. Here is the associated (<code>closed - will not fix</code>) JIRA: https://issues.apache.org/jira/browse/SPARK-10943 So what are folks doing with regards to null column values today when writing out <code>dataframe</code>'s to <code>parquet</code> ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate <code>null</code> - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

You misinterpreted SPARK-10943. Spark does support writing <code>null</code> values to numeric columns. The problem is that <code>null</code> alone carries no type information at all <pre class="prettyprint"><code>scala> spark.sql("SELECT null as comments").printSchema root |-- comments: null (nullable = true) </code></pre> As per comment by Michael Armbrust all you have to do is cast: <pre class="prettyprint"><code>scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema root |-- comments: double (nullable = true) </code></pre> and the result can be safely written to Parquet.

How to handle null values when writing to parquet from Spark

Tags:

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

642

asked May 03 '18 17:05

WestCoastProjects

1 Answers

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema root  |-- comments: null (nullable = true)

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema root |-- comments: double (nullable = true)

and the result can be safely written to Parquet.

165

answered Sep 22 '22 01:09

Alper t. Turker

Related questions
                            
                                When should I use rel=noreferrer?
                            
                                Which standard library to use in Kotlin
                            
                                LocalDateTime.now() has different levels of precision on Windows and Mac machine
                            
                                How to specify method return type list of (what) in Python?
                            
                                Abolute path in dotnet watch run command doesn't work
                            
                                The incoming JSON object does not contain a client_email field
                            
                                C++20: Concepts of multiple types and its constraint, correct syntax?
                            
                                Increasing Prometheus storage retention
                            
                                Super Slow Docker Build
                            
                                docker-machine: command not found
                            
                                FutureWarning: The default value of regex will change from True to False in a future version
                            
                                What do ref, val and out mean on method parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With