Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find indices where df is null

Tags:

python

pandas

In pandas (master branch or upcoming 0.14) how can I find the indices where my dataframe is null?

When I do:

df.isnull()

I get a boolean dataframe of the same size as df

If I do:

df.isnull().index

I get the index of the original df.

What I want is are the indices of those rows with NaN entries (either on some column, or on all columns)

like image 576
Amelio Vazquez-Reina Avatar asked May 23 '14 20:05

Amelio Vazquez-Reina


2 Answers

df.index[df.isnull().any(axis=1)]

The .any(axis=1) will give you True/False for each row whether there is at least one NaN value. With this you can do a boolean indexing of the index to find the indexes where df is null.

like image 162
joris Avatar answered Nov 06 '22 01:11

joris


I would drop to numpy to do this a little faster:

In [11]: df = pd.DataFrame([[np.nan, 1], [0, np.nan], [1, 2]])

In [12]: df
Out[12]:
    0   1
0 NaN   1
1   0 NaN
2   1   2

In [13]: pd.isnull(df.values)
Out[13]:
array([[ True, False],
       [False,  True],
       [False, False]], dtype=bool)

In [14]: pd.isnull(df.values).any(1)
Out[14]: array([ True,  True, False], dtype=bool)

In [15]: np.nonzero(pd.isnull(df.values).any(1))
Out[15]: (array([0, 1]),)

In [16]: df.index[np.nonzero(pd.isnull(df.values).any(1))]
Out[16]: Int64Index([0, 1], dtype='int64')

To see some timings, with a slightly larger df:

In [21]: df = pd.DataFrame([[np.nan, 1], [0, np.nan], [1, 2]] * 1000)

In [22]: %timeit np.nonzero(pd.isnull(df.values).any(1))
10000 loops, best of 3: 85.8 µs per loop

In [23]: %timeit df.index[df.isnull().any(1)]
1000 loops, best of 3: 629 µs per loop

and if you cared about the index (rather than the position):

In [24]: %timeit df.index[np.nonzero(pd.isnull(df.values).any(1))]
10000 loops, best of 3: 172 µs per loop
like image 35
Andy Hayden Avatar answered Nov 06 '22 00:11

Andy Hayden