Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison of a `float` to `np.nan` in Spark Dataframe

Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing?

Python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

from pyspark.sql.functions import col

df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+

df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)

df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#|     true|
#|    false|
#+---------+
like image 841
avloss Avatar asked Mar 18 '19 18:03

avloss


People also ask

How do you check NaN values in PySpark DataFrame?

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().

How is NaN a float?

NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float. NaN value is one of the major problems in Data Analysis.

What is NaN in spark SQL?

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically: NaN = NaN returns true.

Is null and NaN same?

Javascript null represents the intentional absence of any object value. The undefined property indicates that the variable has not been assigned a value or not declared at all. The NaN property represents a “Not-a-Number” value. The NaN property indicates that a value is not a legitimate number.

How to check for Nan in pandas Dataframe?

The ways to check for NaN in Pandas DataFrame are as follows: 1 Check for NaN under a single DataFrame column: 2 Count the NaN under a single DataFrame column: 3 Check for NaN under the whole DataFrame: 4 Count the NaN under the whole DataFrame:

What is the difference between Nan and float in Python?

The returned NaN is a numpy type, not a python type. Also NaN is not a singleton in the first place, so two NaNs are not generally the same in any case. np.float is actually just python's float, so it does nothing, but even then float ('nan') is not float ('nan') (your version only worked because it happens to be a no-op doing nothing at all).

How to get a count of non NaN values of pyspark Dataframe?

Below example demonstrates how to get a count of non Nan Values of a PySpark DataFrame column. #Find count of non nan values of DataFrame column import numpy as np from pyspark. sql. functions import isnan data = [(1,340.0),(1, None),(3,200.0),(4, np.

How to find if a Dataframe column has null or Nan?

Find Count of Null, None, NaN of All DataFrame Columns df.columns returns all DataFrame columns as a list, will loop through the list, and check each column has Null or NaN values. On below snippet isnan () is a SQL function that is used to check for NAN values and isNull () is a Column class function that is used to check for Null values.


1 Answers

That is both expected and documented behavior. To quote NaN Semantics section of the official Spark SQL Guide (emphasis mine):

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

  • NaN = NaN returns true.
  • In aggregations, all NaN values are grouped together.
  • NaN is treated as a normal value in join keys.
  • NaN values go last when in ascending order, larger than any other numeric value.

AdAs you see ordering behavior is not the only difference, compared to Python NaN. In particular Spark considers NaN's equal:

spark.sql("""
    WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y) 
    SELECT x = y, x != y FROM table
""").show()
+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
|   true|        false|
+-------+-------------+

while plain Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")
(False, True)

and NumPy

np.nan == np.nan, np.nan != np.nan
(False, True)

don't.

You can check eqNullSafe docstring for additional examples.

So to get desired result you'll have to explicitly check for NaN's

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))
like image 167
user10938362 Avatar answered Sep 18 '22 09:09

user10938362