Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing?
Python
import numpy as np
>>> np.nan < 0.0
False
>>> np.nan > 0.0
False
PySpark
from pyspark.sql.functions import col
df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+
df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)
df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#| true|
#| false|
#+---------+
In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float. NaN value is one of the major problems in Data Analysis.
There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically: NaN = NaN returns true.
Javascript null represents the intentional absence of any object value. The undefined property indicates that the variable has not been assigned a value or not declared at all. The NaN property represents a “Not-a-Number” value. The NaN property indicates that a value is not a legitimate number.
The ways to check for NaN in Pandas DataFrame are as follows: 1 Check for NaN under a single DataFrame column: 2 Count the NaN under a single DataFrame column: 3 Check for NaN under the whole DataFrame: 4 Count the NaN under the whole DataFrame:
The returned NaN is a numpy type, not a python type. Also NaN is not a singleton in the first place, so two NaNs are not generally the same in any case. np.float is actually just python's float, so it does nothing, but even then float ('nan') is not float ('nan') (your version only worked because it happens to be a no-op doing nothing at all).
Below example demonstrates how to get a count of non Nan Values of a PySpark DataFrame column. #Find count of non nan values of DataFrame column import numpy as np from pyspark. sql. functions import isnan data = [(1,340.0),(1, None),(3,200.0),(4, np.
Find Count of Null, None, NaN of All DataFrame Columns df.columns returns all DataFrame columns as a list, will loop through the list, and check each column has Null or NaN values. On below snippet isnan () is a SQL function that is used to check for NAN values and isNull () is a Column class function that is used to check for Null values.
That is both expected and documented behavior. To quote NaN Semantics section of the official Spark SQL Guide (emphasis mine):
There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:
- NaN = NaN returns true.
- In aggregations, all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.
AdAs you see ordering behavior is not the only difference, compared to Python NaN. In particular Spark considers NaN's equal:
spark.sql("""
WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y)
SELECT x = y, x != y FROM table
""").show()
+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
| true| false|
+-------+-------------+
while plain Python
float("NaN") == float("NaN"), float("NaN") != float("NaN")
(False, True)
and NumPy
np.nan == np.nan, np.nan != np.nan
(False, True)
don't.
You can check eqNullSafe
docstring for additional examples.
So to get desired result you'll have to explicitly check for NaN's
from pyspark.sql.functions import col, isnan, when
when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With