The toPandas method in pyspark is not consistent for null values in numerical columns. Is there a way to force it to be more consistent?
An example
sc is the sparkContext. The spark version is 2.3.2. I'm not sure how to include notebook results, but I'll just comment the outputs. It's pretty straightforward, and you can check it yourself in a notebook.
sparkTest = sc.createDataFrame(
[
(1, 1 ),
(2, None),
(None, None),
],
['a', 'b']
)
sparkTest.show() # all None values are neatly converted to null
pdTest1 = sparkTest.toPandas()
pdTest1 # all None values are NaN
np.isnan(pdTest1['b']) # this a series of dtype bool
pdTest2 = sparkTest.filter(col('b').isNull()).toPandas()
pdTest2 # the null value in column a is still NaN, but the two null in column b are now None
np.isnan(pdTest2['b']) # this throws an error
This is of course problematic when programming, and not being able to predict beforehand if a column will be all nulls.
Incidentally I wanted to report this as an issue, but I'm not sure where. The github page doesn't seem to have an issues section?
np.isnan can be applied to NumPy arrays of native dtype (such as np.float64), but raises TypeError when applied to object arrays:
pdTest1['b']
0 1.0
1 NaN
2 NaN
Name: b, dtype: float64
pdTest2['b']
0 None
1 None
Name: b, dtype: object
If you have pandas, you could use pandas.isnull instead:
import pandas as pd
pd.isnull(pdTest1['b'])
0 False
1 True
2 True
Name: b, dtype: bool
pd.isnull(pdTest2['b'])
0 True
1 True
Name: b, dtype: bool
Which is consistent for both np.nan and None.
Alternatively, you could (if possible given your data), cast your pdTest2['b'] array as one of the native numpy types (such as np.float64) to ensure np.isnan is working, such as:
pdTest2 = sparkTest.filter(f.col('b').isNull()).toPandas()
np.isnan(pdTest2['b'].astype(np.float64))
0 True
1 True
Name: b, dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With