Force consistent conversion of null to nan when using toPandas

Question

The toPandas method in pyspark is not consistent for null values in numerical columns. Is there a way to force it to be more consistent?

An example

sc is the sparkContext. The spark version is 2.3.2. I'm not sure how to include notebook results, but I'll just comment the outputs. It's pretty straightforward, and you can check it yourself in a notebook.

sparkTest = sc.createDataFrame(
    [
        (1,    1   ),
        (2,    None),
        (None, None),
    ],
    ['a', 'b']
)
sparkTest.show() # all None values are neatly converted to null

pdTest1 = sparkTest.toPandas()
pdTest1 # all None values are NaN
np.isnan(pdTest1['b']) # this a series of dtype bool

pdTest2 = sparkTest.filter(col('b').isNull()).toPandas()
pdTest2 # the null value in column a is still NaN, but the two null in column b are now None
np.isnan(pdTest2['b']) # this throws an error

This is of course problematic when programming, and not being able to predict beforehand if a column will be all nulls.

Incidentally I wanted to report this as an issue, but I'm not sure where. The github page doesn't seem to have an issues section?

Napoleon Borntoparty · Accepted Answer

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64), but raises TypeError when applied to object arrays:

pdTest1['b']
0    1.0
1    NaN
2    NaN
Name: b, dtype: float64

pdTest2['b']
0    None
1    None
Name: b, dtype: object

If you have pandas, you could use pandas.isnull instead:

import pandas as pd


pd.isnull(pdTest1['b'])
0    False
1     True
2     True
Name: b, dtype: bool


pd.isnull(pdTest2['b'])
0    True
1    True
Name: b, dtype: bool

Which is consistent for both np.nan and None.

Alternatively, you could (if possible given your data), cast your pdTest2['b'] array as one of the native numpy types (such as np.float64) to ensure np.isnan is working, such as:

pdTest2 = sparkTest.filter(f.col('b').isNull()).toPandas()
np.isnan(pdTest2['b'].astype(np.float64)) 
0    True
1    True
Name: b, dtype: bool

Force consistent conversion of null to nan when using toPandas

Tags:

python

pandas

numpy

pyspark

Willem

1 Answers

Napoleon Borntoparty

Recent Activity

Donate For Us

Force consistent conversion of null to nan when using toPandas

Tags:

python

pandas

numpy

pyspark

Willem

1 Answers

Napoleon Borntoparty

Related questions

Recent Activity

Donate For Us