Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2 Dataset Null value exception

Getting this null error in spark Dataset.filter

Input CSV:

name,age,stat
abc,22,m
xyz,,s

Working code:

case class Person(name: String, age: Long, stat: String)

val peopleDS = spark.read.option("inferSchema","true")
  .option("header", "true").option("delimiter", ",")
  .csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()

Failing code (Adding following lines return error):

val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()

Returns null error

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
like image 675
xstack2000 Avatar asked Jan 15 '17 19:01

xstack2000


People also ask

How do you handle null values in Spark?

You can keep null values out of certain columns by setting nullable to false . You won't be able to set nullable to false for all columns in a DataFrame and pretend like null values don't exist. For example, when joining DataFrames, the join column will return null when a match cannot be made.

How do I change the null value in Spark DataFrame?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

Does Spark join null values?

Spark SQL supports null ordering specification in ORDER BY clause. Spark processes the ORDER BY clause by placing all the NULL values at first or at last depending on the null ordering specification.

How do you handle null values?

Putting the median or mean of the whole column was the simple approach. But I like a bit more specific approach to the median and mean. Instead of taking the median of the whole age column and filling up all the null values, filling up the null values using the mean age of each pclass and 'alive' will be more accurate.

How does spark handle null values?

Native Spark code handles null gracefully. Let’s create a DataFrame with numbers so we have some data to play with. Now let’s add a column that returns true if the number is even, false if the number is odd, and null otherwise. The Spark % function returns null when the input is null.

How to handle exceptions in a spark dataframe?

In Spark 2.1.0, we can have the following code, which would handle the exceptions and append them to our accumulator. We use Try - Success/Failure in the Scala way of handling exceptions. We cannot have Try [Int] as a type in our DataFrame, thus we would have to handle the exceptions and add them to the accumulator.

What is the use of% function in spark?

The Spark % function returns null when the input is null. Actually all Spark functions return null when the input is null. All of your Spark functions should return null when the input is null too! Native Spark code cannot always be used and sometimes you’ll need to fall back on Scala code and User Defined Functions.

How to read blank values in a Dataframe in spark?

All the blank values and empty strings are read into a DataFrame as null by the Spark CSV library (after Spark 2.0.1 at least). val peopleDf = spark.read.option ("header", "true").csv (path)


1 Answers

Exception you get should explain everything but let's go step-by-step:

  • When load data using csv data source all fields are marked as nullable:

    val path: String = ???
    
    val peopleDF = spark.read
      .option("inferSchema","true")
      .option("header", "true")
      .option("delimiter", ",")
      .csv(path)
    
    peopleDF.printSchema
    
    root
    |-- name: string (nullable = true)
    |-- age: integer (nullable = true)
    |-- stat: string (nullable = true)
    
  • Missing field is represented as SQL NULL

    peopleDF.where($"age".isNull).show
    
    +----+----+----+
    |name| age|stat|
    +----+----+----+
    | xyz|null|   s|
    +----+----+----+
    
  • Next you convert Dataset[Row] to Dataset[Person] which uses Long to encode age field. Long in Scala cannot be null. Because input schema is nullable, output schema stays nullable despite of that:

    val peopleDS = peopleDF.as[Person]
    
    peopleDS.printSchema
    
    root
     |-- name: string (nullable = true)
     |-- age: integer (nullable = true)
     |-- stat: string (nullable = true)
    

    Note that it as[T] doesn't affect the schema at all.

  • When you query Dataset using SQL (on registered table) or DataFrame API Spark won't deserialize the object. Since schema is still nullable we can execute:

    peopleDS.where($"age" > 30).show
    
    +----+---+----+
    |name|age|stat|
    +----+---+----+
    +----+---+----+
    

    without any issues. This is just a plain SQL logic and NULL is a valid value.

  • When we use statically typed Dataset API:

    peopleDS.filter(_.age > 30)
    

    Spark has to deserialize the object. Because Long cannot be null (SQL NULL) it fails with exception you've seen.

    If it wasn't for that you'd get NPE.

  • Correct statically typed representation of your data should use Optional types:

    case class Person(name: String, age: Option[Long], stat: String)
    

    with adjusted filter function:

    peopleDS.filter(_.age.map(_ > 30).getOrElse(false))
    
    +----+---+----+
    |name|age|stat|
    +----+---+----+
    +----+---+----+
    

    If you prefer you can use pattern matching:

    peopleDS.filter {
      case Some(age) => age > 30
      case _         => false     // or case None => false
    }
    

    Note that you don't have to (but it would be recommended anyway) to use optional types for name and stat. Because Scala String is just a Java String it can be null. Of course if you go with this approach you have to explicitly check if accessed values are null or not.

Related Spark 2.0 Dataset vs DataFrame

like image 171
zero323 Avatar answered Oct 17 '22 04:10

zero323