Comparison of a `float` to `np.nan` in Spark Dataframe

Tags:

Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing?

Python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

from pyspark.sql.functions import col

df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+

df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)

df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#|     true|
#|    false|
#+---------+

841

asked Mar 18 '19 18:03

avloss

1 Answers

That is both expected and documented behavior. To quote NaN Semantics section of the official Spark SQL Guide (emphasis mine):

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

NaN = NaN returns true.

In aggregations, all NaN values are grouped together.

NaN is treated as a normal value in join keys.

NaN values go last when in ascending order, larger than any other numeric value.

AdAs you see ordering behavior is not the only difference, compared to Python NaN. In particular Spark considers NaN's equal:

spark.sql("""
    WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y) 
    SELECT x = y, x != y FROM table
""").show()

+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
|   true|        false|
+-------+-------------+

while plain Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")

(False, True)

and NumPy

np.nan == np.nan, np.nan != np.nan

(False, True)

don't.

You can check eqNullSafe docstring for additional examples.

So to get desired result you'll have to explicitly check for NaN's

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))

167

answered Sep 18 '22 09:09

user10938362

Related questions
                            
                                How to rotate images at different angles randomly in tensorflow
                            
                                How to check If Path Exists Using Fabric2.x
                            
                                What is the difference between super().__repr__() and repr(super())?
                            
                                Python - Creating Dictionaries by reading text files and searching through that dictionary
                            
                                qloguniform search space setting issue in Hyperopt
                            
                                How to add database routers to a Django project
                            
                                pandas: subtracting current date from the date in a pandas table
                            
                                Flask Celery task locking
                            
                                How do I get a loss per epoch and not per batch?
                            
                                AttributeError: Layer has no inbound nodes, or AttributeError: The layer has never been called
                            
                                Filter pandas dataframe rows if any value on a list inside the dataframe is in another list
                            
                                Open the authorization URL without opening browser Python
                            
                                Separate rooms in a floor plan using OpenCV
                            
                                Python: Distributed task queue for different specific workers
                            
                                Python asyncio skip processing untill function return
                            
                                Specify keys for mypy in python dictionary
                            
                                Swagger with Flask-Restplus, API and multiple Blueprints
                            
                                How to use Selenium on Colaboratory Google?
                            
                                Python: How to write error in the console in txt file?
                            
                                Python - why mock patch decorator does not pass the mocked object to the test function when `new` argument is not DEFAULT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparison of a `float` to `np.nan` in Spark Dataframe

Tags:

python

nan

numpy

apache-spark

pyspark

avloss

People also ask

1 Answers

user10938362

Recent Activity

Donate For Us