Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare Series containing None

Tags:

python

pandas

I am using the python shift function to compare if a value in a Series is equal to the previus value. Basically

import pandas as pd

a = pd.Series([2, 2, 4, 5])

a == a.shift()
Out[1]: 
0    False
1     True
2    False
3    False
dtype: bool

This is as expected. (The first comparison is False because we are comparing with the NA of the shifted series). Now, I do have Series where I don't have any value, ie. None, like this

b = pd.Series([None, None, 4, 5])

Here the comparison of the two Nones gives False

b == b.shift()
Out[3]: 
0    False
1    False
2    False
3    False
dtype: bool

I'd be willing to accept some sort of philosophical reasoning arguing that comparing None is meaningless etc., however

c = None
d = None
c == d
Out[4]: True

What is going on here?!

And, what I really want to know is; how can I perform my comparison of my b-Series, given that I want it to treat None's as equal? That is I want b == b.shift() to give the same result as a == a.shift() gave.

like image 972
mortysporty Avatar asked Aug 17 '17 14:08

mortysporty


Video Answer


1 Answers

The None get casted to NaN and NaN has the property that it is not equal to itself:

[54]:
b = pd.Series([None, None, 4, 5])
b

Out[54]: 
0    NaN
1    NaN
2    4.0
3    5.0
dtype: float64

As you can see here:

In[55]:
b==b

Out[55]: 
0    False
1    False
2     True
3     True
dtype: bool

I'm not sure how you can get this to work correctly, although this works:

In[68]:
( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) )

Out[68]: 
0     True
1     True
2    False
3    False
dtype: bool

You'll get a false result for the first row because when you shift down you're comparing against a non-existent row:

In[69]:
b.shift()

Out[69]: 
0    NaN
1    NaN
2    NaN
3    4.0
dtype: float64

So the NaN is comparing True from the boolean logic as the first row is NaN and so is the shifted series' first row.

To work around the first row False-positive you could slice the resultant result to ignore the first row:

In[70]:
( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) )[1:]

Out[70]: 
1     True
2    False
3    False
dtype: bool

As to why it gets casted, Pandas tries to coerce the data to a compatible numpy, here float is selected because of the ints and None values, None and NaN cannot be represented by ints

To get the same result as a in your example, you should overwrite the first row to False as it should always fail:

In[78]:
result = pd.Series( ( (b == b.shift())  | ( (b != b.shift()) &  (b != b) ) ) )
result.iloc[0] = False
result

Out[78]: 
0    False
1     True
2    False
3    False
dtype: bool
like image 133
EdChum Avatar answered Sep 30 '22 07:09

EdChum