Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force consistent conversion of null to nan when using toPandas

The toPandas method in pyspark is not consistent for null values in numerical columns. Is there a way to force it to be more consistent?

An example

sc is the sparkContext. The spark version is 2.3.2. I'm not sure how to include notebook results, but I'll just comment the outputs. It's pretty straightforward, and you can check it yourself in a notebook.

sparkTest = sc.createDataFrame(
    [
        (1,    1   ),
        (2,    None),
        (None, None),
    ],
    ['a', 'b']
)
sparkTest.show() # all None values are neatly converted to null

pdTest1 = sparkTest.toPandas()
pdTest1 # all None values are NaN
np.isnan(pdTest1['b']) # this a series of dtype bool

pdTest2 = sparkTest.filter(col('b').isNull()).toPandas()
pdTest2 # the null value in column a is still NaN, but the two null in column b are now None
np.isnan(pdTest2['b']) # this throws an error

This is of course problematic when programming, and not being able to predict beforehand if a column will be all nulls.

Incidentally I wanted to report this as an issue, but I'm not sure where. The github page doesn't seem to have an issues section?

like image 625
Willem Avatar asked Dec 07 '25 08:12

Willem


1 Answers

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64), but raises TypeError when applied to object arrays:

pdTest1['b']
0    1.0
1    NaN
2    NaN
Name: b, dtype: float64

pdTest2['b']
0    None
1    None
Name: b, dtype: object

If you have pandas, you could use pandas.isnull instead:

import pandas as pd


pd.isnull(pdTest1['b'])
0    False
1     True
2     True
Name: b, dtype: bool


pd.isnull(pdTest2['b'])
0    True
1    True
Name: b, dtype: bool

Which is consistent for both np.nan and None.

Alternatively, you could (if possible given your data), cast your pdTest2['b'] array as one of the native numpy types (such as np.float64) to ensure np.isnan is working, such as:

pdTest2 = sparkTest.filter(f.col('b').isNull()).toPandas()
np.isnan(pdTest2['b'].astype(np.float64)) 
0    True
1    True
Name: b, dtype: bool
like image 192
Napoleon Borntoparty Avatar answered Dec 09 '25 21:12

Napoleon Borntoparty