Is there a good way to find the set of column indices for non-zero values in each row in pandas' data frame? Do I have to traverse the data frame row-by-row?
For example, the data frame is
c1 c2 c3 c4 c5 c6 c7 c8 c9
1 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 2 1 1 1 1 1 0 2
1 5 5 0 0 1 0 4 6
4 3 0 1 1 1 1 5 10
3 5 2 4 1 2 2 1 3
6 4 0 1 0 0 0 0 0
3 9 1 0 1 0 2 1 0
The output is expected to be
['c1','c2']
['c1']
['c2']
...
It seems you have to traverse the DataFrame by row.
cols = df.columns
bt = df.apply(lambda x: x > 0)
bt.apply(lambda x: list(cols[x.values]), axis=1)
and you will get:
0 [c1, c2]
1 [c1]
2 [c2]
3 [c1]
4 [c2]
5 []
6 [c2, c3, c4, c5, c6, c7, c9]
7 [c1, c2, c3, c6, c8, c9]
8 [c1, c2, c4, c5, c6, c7, c8, c9]
9 [c1, c2, c3, c4, c5, c6, c7, c8, c9]
10 [c1, c2, c4]
11 [c1, c2, c3, c5, c7, c8]
dtype: object
If performance is matter, try to pass raw=True
to boolean DataFrame creation like below:
%timeit df.apply(lambda x: x > 0, raw=True).apply(lambda x: list(cols[x.values]), axis=1)
1000 loops, best of 3: 812 µs per loop
It brings you a better performance gain. Following is raw=False
(which is default) result:
%timeit df.apply(lambda x: x > 0).apply(lambda x: list(cols[x.values]), axis=1)
100 loops, best of 3: 2.59 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With