The issues: 1) Spark doesn't call UDF if input is column of primitive type that contains <code>null</code>: <pre class="prettyprint"><code>inputDF.show() +-----+ | x | +-----+ | null| | 1.0| +-----+ inputDF .withColumn("y", udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null ) .show() +-----+-----+ | x | y | +-----+-----+ | null| null| | 1.0| 2.0| +-----+-----+ </code></pre> 2) Can't produce <code>null</code> from UDF as a column of primitive type: <code>udf { (x: String) => null: Double } // compile error</code>

Accordingly to the docs: <blockquote> Note that if you use primitive parameters, you are not able to check if it is null or not, and the UDF will return null for you if the primitive input is null. Use boxed type or [[Option]] if you wanna do the null-handling yourself. </blockquote> So, the easiest solution is just to use boxed types if your UDF input is nullable column of primitive type OR/AND you need to output null from UDF as a column of primitive type: <pre class="prettyprint"><code>inputDF .withColumn("y", udf { (x: java.lang.Double) => (if (x == null) 1 else null): java.lang.Integer }.apply($"x") ) .show() +-----+-----+ | x | y | +-----+-----+ | null| null| | 1.0| 2.0| +-----+-----+ </code></pre>

I would also use Artur's solution, but there is also another way without using javas wrapper classes by using <code>struct</code>: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.struct import org.apache.spark.sql.Row inputDF .withColumn("y", udf { (r: Row) => if (r.isNullAt(0)) Some(1) else None }.apply(struct($"x")) ) .show() </code></pre>

How to deal with Spark UDF input/output of primitive nullable type

Tags:

sql

null

apache-spark

udf

The issues:

1) Spark doesn't call UDF if input is column of primitive type that contains null:

inputDF.show()

+-----+
|  x  |
+-----+
| null|
|  1.0|
+-----+

inputDF
  .withColumn("y",
     udf { (x: Double) => 2.0 }.apply($"x") // will not be invoked if $"x" == null
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+

2) Can't produce null from UDF as a column of primitive type:

udf { (x: String) => null: Double } // compile error

786

asked Mar 14 '17 16:03

Artur Rashitov

2 Answers

Accordingly to the docs:

Note that if you use primitive parameters, you are not able to check if it is null or not, and the UDF will return null for you if the primitive input is null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.

So, the easiest solution is just to use boxed types if your UDF input is nullable column of primitive type OR/AND you need to output null from UDF as a column of primitive type:

inputDF
  .withColumn("y",
     udf { (x: java.lang.Double) => 
       (if (x == null) 1 else null): java.lang.Integer
     }.apply($"x")
  )
  .show()

+-----+-----+
|  x  |  y  |
+-----+-----+
| null| null|
|  1.0|  2.0|
+-----+-----+

answered Oct 16 '22 14:10

Artur Rashitov

I would also use Artur's solution, but there is also another way without using javas wrapper classes by using struct:

import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.Row

inputDF
  .withColumn("y",
     udf { (r: Row) => 
       if (r.isNullAt(0)) Some(1) else None
     }.apply(struct($"x"))
  )
  .show()

answered Oct 16 '22 13:10

Raphael Roth

Related questions
                            
                                Count active users using login timestamp in MySQL
                            
                                Get column names Dynamically with SQLAlchemy
                            
                                Python - Using pyodbc to connect to remote server using info from Excel data connection
                            
                                Update Column value for all rows in Table where Column Value Is Null?
                            
                                How to compare two tables in Postgresql?
                            
                                How can I create a foreign key with Index in one single create table statement? (Oracle)
                            
                                How to run sql script with Entity Framework in .NET Core 1.0?
                            
                                DATE() MONTH() etc. functions slow down query
                            
                                How to show present privileges on table in IBM DB2 Z/OS with SQL
                            
                                SQL - Updating Employee Title based on most Recent Position
                            
                                How to prevent quoting of column name in SQL in Yii2
                            
                                Connect to SQL Server Developer edition
                            
                                Find values that do not exist in a table
                            
                                MySQL - How to select rows with max value of a field
                            
                                MYSQL Left join extremely slow on indexed columns
                            
                                Can the order of criteria in a WHERE clause affect performance in MySQL?
                            
                                DENSE_RANK() without duplication
                            
                                Multiple clauses in SQL Server where all columns do not equal zero
                            
                                LISTAGG alternative in Oracle 10g
                            
                                How to update Sql table from excel directly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to deal with Spark UDF input/output of primitive nullable type

Tags:

sql

null

apache-spark

udf

Artur Rashitov

People also ask

2 Answers

Artur Rashitov

Raphael Roth

Recent Activity

Donate For Us