Given Table 1 with one column "x" of type String. I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x". Essential is to keep <code>null</code> values in column "y". Table 1 (Dataframe df1): <pre class="prettyprint"><code>+----------+ | x| +----------+ |2015-09-12| |2015-09-13| | null| | null| +----------+ root |-- x: string (nullable = true) </code></pre> Table 2 (Dataframe df2): <pre class="prettyprint"><code>+----------+--------+ | x| y| +----------+--------+ | null| null| | null| null| |2015-09-12|20150912| |2015-09-13|20150913| +----------+--------+ root |-- x: string (nullable = true) |-- y: integer (nullable = true) </code></pre> While the user-defined function (udf) to convert values from column "x" into those of column "y" is: <pre class="prettyprint"><code>val extractDateAsInt = udf[Int, String] ( (d:String) => d.substring(0, 10) .filterNot( "-".toSet) .toInt ) </code></pre> and works, dealing with null values is not possible. Even though, I can do something like <pre class="prettyprint"><code>val extractDateAsIntWithNull = udf[Int, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else 1 ) </code></pre> I have found no way, to "produce" <code>null</code> values via udfs (of course, as <code>Int</code>s can not be <code>null</code>). My current solution for creation of df2 (Table 2) is as follows: <pre class="prettyprint"><code>// holds data of table 1 val df1 = ... // filter entries from df1, that are not null val dfNotNulls = df1.filter(df1("x") .isNotNull) .withColumn("y", extractDateAsInt(df1("x"))) .withColumnRenamed("x", "right_x") // create df2 via a left join on df1 and dfNotNull having val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x") </code></pre> Questions: <ul> <li>The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?</li> <li>@Spark-developers: Is there a type <code>NullableInt</code> planned / avaiable, such that the following udf is possible (see Code excerpt ) ?</li> </ul> Code excerpt <pre class="prettyprint"><code>val extractDateAsNullableInt = udf[NullableInt, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else null ) </code></pre>

This is where <code>Option</code>comes in handy: <pre class="prettyprint"><code>val extractDateAsOptionInt = udf((d: String) => d match { case null => None case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt) }) </code></pre> or to make it slightly more secure in general case: <pre class="prettyprint"><code>import scala.util.Try val extractDateAsOptionInt = udf((d: String) => Try( d.substring(0, 10).filterNot("-".toSet).toInt ).toOption) </code></pre> All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here. Alternative is to handle <code>null</code> outside the UDF: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.{lit, when} import org.apache.spark.sql.types.IntegerType val extractDateAsInt = udf( (d: String) => d.substring(0, 10).filterNot("-".toSet).toInt ) df.withColumn("y", when($"x".isNull, lit(null)) .otherwise(extractDateAsInt($"x")) .cast(IntegerType) ) </code></pre>

Scala actually has a nice factory function, Option(), that can make this even more concise: <pre class="prettyprint"><code>val extractDateAsOptionInt = udf((d: String) => Option(d).map(_.substring(0, 10).filterNot("-".toSet).toInt)) </code></pre> Internally the Option object's apply method is just doing the null check for you: <pre class="prettyprint"><code>def apply[A](x: A): Option[A] = if (x == null) None else Some(x) </code></pre>

SparkSQL: How to deal with null values in user defined function?

Tags:

scala

nullable

apache-spark

apache-spark-sql

user-defined-functions

Given Table 1 with one column "x" of type String. I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x".

Essential is to keep null values in column "y".

Table 1 (Dataframe df1):

+----------+ |         x| +----------+ |2015-09-12| |2015-09-13| |      null| |      null| +----------+ root  |-- x: string (nullable = true)

Table 2 (Dataframe df2):

+----------+--------+                                                                   |         x|       y| +----------+--------+ |      null|    null| |      null|    null| |2015-09-12|20150912| |2015-09-13|20150913| +----------+--------+ root  |-- x: string (nullable = true)  |-- y: integer (nullable = true)

While the user-defined function (udf) to convert values from column "x" into those of column "y" is:

val extractDateAsInt = udf[Int, String] (   (d:String) => d.substring(0, 10)       .filterNot( "-".toSet)       .toInt )

and works, dealing with null values is not possible.

Even though, I can do something like

val extractDateAsIntWithNull = udf[Int, String] (   (d:String) =>      if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt      else 1 )

I have found no way, to "produce" null values via udfs (of course, as Ints can not be null).

My current solution for creation of df2 (Table 2) is as follows:

// holds data of table 1   val df1 = ...   // filter entries from df1, that are not null val dfNotNulls = df1.filter(df1("x")   .isNotNull)   .withColumn("y", extractDateAsInt(df1("x")))   .withColumnRenamed("x", "right_x")  // create df2 via a left join on df1 and dfNotNull having  val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")

Questions:

The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?
@Spark-developers: Is there a type NullableInt planned / avaiable, such that the following udf is possible (see Code excerpt ) ?

Code excerpt

val extractDateAsNullableInt = udf[NullableInt, String] (   (d:String) =>      if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt      else null )

466

asked Sep 02 '15 15:09

Martin Senne

2 Answers

This is where Optioncomes in handy:

val extractDateAsOptionInt = udf((d: String) => d match {   case null => None   case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt) })

or to make it slightly more secure in general case:

import scala.util.Try  val extractDateAsOptionInt = udf((d: String) => Try(   d.substring(0, 10).filterNot("-".toSet).toInt ).toOption)

All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here.

Alternative is to handle null outside the UDF:

import org.apache.spark.sql.functions.{lit, when} import org.apache.spark.sql.types.IntegerType  val extractDateAsInt = udf(    (d: String) => d.substring(0, 10).filterNot("-".toSet).toInt )  df.withColumn("y",   when($"x".isNull, lit(null))     .otherwise(extractDateAsInt($"x"))     .cast(IntegerType) )

answered Sep 19 '22 23:09

zero323

Scala actually has a nice factory function, Option(), that can make this even more concise:

val extractDateAsOptionInt = udf((d: String) =>    Option(d).map(_.substring(0, 10).filterNot("-".toSet).toInt))

Internally the Option object's apply method is just doing the null check for you:

def apply[A](x: A): Option[A] = if (x == null) None else Some(x)

answered Sep 22 '22 23:09

tristanbuckner

Related questions
                            
                                Visitor Pattern in Scala
                            
                                How to combine Option values in Scala?
                            
                                Adding new task dependencies to built-in SBT tasks?
                            
                                Scala equivalent of Java's Number
                            
                                GUI in Scala/Groovy/Clojure
                            
                                Iterate rows and columns in Spark dataframe
                            
                                How to add Jar libraries to an IntelliJ Idea SBT Scala project?
                            
                                Convert from scala.collection.Seq<String> to java.util.List<String> in Java code
                            
                                Convert Option to Either in Scala
                            
                                Optionally adding items to a Scala Map
                            
                                String pattern matching best practice
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                Scala, the java of the future(?) [closed]
                            
                                Scala 2.10 reflection, how do I extract the field values from a case class, i.e. field list from case class
                            
                                Scala replacement for Arrays.binarySearch?
                            
                                The tilde operator in Scala
                            
                                Map can not be serializable in scala?
                            
                                What is Scala's "powerful" type system?
                            
                                Getting object instance by string name in scala
                            
                                Play Framework: How to serialize/deserialize an enumeration to/from JSON

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With