In pandas (master branch or upcoming 0.14) how can I find the indices where my dataframe is null?
When I do:
df.isnull()
I get a boolean dataframe of the same size as df
If I do:
df.isnull().index
I get the index of the original df
.
What I want is are the indices of those rows with NaN entries (either on some column, or on all columns)
df.index[df.isnull().any(axis=1)]
The .any(axis=1)
will give you True/False for each row whether there is at least one NaN value. With this you can do a boolean indexing of the index to find the indexes where df is null.
I would drop to numpy to do this a little faster:
In [11]: df = pd.DataFrame([[np.nan, 1], [0, np.nan], [1, 2]])
In [12]: df
Out[12]:
0 1
0 NaN 1
1 0 NaN
2 1 2
In [13]: pd.isnull(df.values)
Out[13]:
array([[ True, False],
[False, True],
[False, False]], dtype=bool)
In [14]: pd.isnull(df.values).any(1)
Out[14]: array([ True, True, False], dtype=bool)
In [15]: np.nonzero(pd.isnull(df.values).any(1))
Out[15]: (array([0, 1]),)
In [16]: df.index[np.nonzero(pd.isnull(df.values).any(1))]
Out[16]: Int64Index([0, 1], dtype='int64')
To see some timings, with a slightly larger df:
In [21]: df = pd.DataFrame([[np.nan, 1], [0, np.nan], [1, 2]] * 1000)
In [22]: %timeit np.nonzero(pd.isnull(df.values).any(1))
10000 loops, best of 3: 85.8 µs per loop
In [23]: %timeit df.index[df.isnull().any(1)]
1000 loops, best of 3: 629 µs per loop
and if you cared about the index (rather than the position):
In [24]: %timeit df.index[np.nonzero(pd.isnull(df.values).any(1))]
10000 loops, best of 3: 172 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With