Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count non NaN values accross columns in pandas dataframe?

My data looks like this:

            Close   a   b   c   d   e   Time    
2015-12-03  2051.25 5   4   3   1   1   05:00:00    
2015-12-04  2088.25 5   4   3   1   NaN 06:00:00
2015-12-07  2081.50 5   4   3   NaN NaN 07:00:00
2015-12-08  2058.25 5   4   NaN NaN NaN 08:00:00
2015-12-09  2042.25 5   NaN NaN NaN NaN 09:00:00

I need to count 'horizontally' the values in the columns ['a'] to ['e'] that are not NaN. So the outcome would be this:

df['Count'] = .....
df

            Close   a   b   c   d   e   Time     Count
2015-12-03  2051.25 5   4   3   1   1   05:00:00 5  
2015-12-04  2088.25 5   4   3   1   NaN 06:00:00 4
2015-12-07  2081.50 5   4   3   NaN NaN 07:00:00 3
2015-12-08  2058.25 5   4   NaN NaN NaN 08:00:00 2
2015-12-09  2042.25 5   NaN NaN NaN NaN 09:00:00 1

Thanks

like image 614
hernanavella Avatar asked Dec 15 '22 09:12

hernanavella


1 Answers

You can subselect from your df and call count passing axis=1:

In [24]:
df['count'] = df[list('abcde')].count(axis=1)
df

Out[24]:
              Close  a   b   c   d   e      Time  count
2015-12-03  2051.25  5   4   3   1   1  05:00:00      5
2015-12-04  2088.25  5   4   3   1 NaN  06:00:00      4
2015-12-07  2081.50  5   4   3 NaN NaN  07:00:00      3
2015-12-08  2058.25  5   4 NaN NaN NaN  08:00:00      2
2015-12-09  2042.25  5 NaN NaN NaN NaN  09:00:00      1

TIMINGS

In [25]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)

100 loops, best of 3: 3.28 ms per loop
100 loops, best of 3: 2.76 ms per loop
100 loops, best of 3: 2.98 ms per loop

apply is the slowest which is not a surprise, the drop version is marginally faster but semantically I prefer just passing the list of cols of interest and calling count for readability

Hmm I keep getting varying timings now:

In [27]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)

100 loops, best of 3: 3.33 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.7 ms per loop
100 loops, best of 3: 2.57 ms per loop

MORE TIMINGS

In [160]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1) 

1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.05 ms per loop

It seems that testing for notnull and summing (as notnull will produce a boolean mask) is quicker on this dataset

On a 50k row df the last method is slightly quicker:

In [172]:
%timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1)
%timeit df.drop(['Close', 'Time'], axis=1).count(axis=1)
%timeit df[list('abcde')].count(axis=1)
%timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1)
%timeit df[list('abcde')].notnull().sum(axis=1) 

1 loops, best of 3: 5.83 s per loop
100 loops, best of 3: 6.15 ms per loop
100 loops, best of 3: 6.49 ms per loop
100 loops, best of 3: 6.04 ms per loop
like image 108
EdChum Avatar answered May 23 '23 16:05

EdChum