Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find rows with non zero values in a subset of columns in pandas dataframe

I have a datframe with 4 columns of strings and others as integers. Now I need to find out those rows of data where at least one of the column is a non-zero value (or > 0).

manwra,sahAyaH,T7,0,0,0,0,T
manwra, akriti,T5,0,0,1,0,K 
awma, prabrtih,B6, 0,1,1,0,S

My output should be

manwra, akriti,T5,0,0,1,0,K 
awma, prabrtih,B6, 0,1,1,0,S

I have tried the following to obtain the answer. The string values are in colums 0,1,2 and -1 (last column).

KT[KT.ix[:,3:-2] != 0]

What I am receiving as output is

NaN,NaNNaN,NaN,NaN,NaN,NaN,NaN
NaN,NaN,NaN,NaN,NaN,1,NaN,NaN
NaN,NaN,NaN,NaN,1,1,NaN,NaN

How to obtain the desired output

like image 972
Amrith Krishna Avatar asked Aug 28 '16 03:08

Amrith Krishna


2 Answers

Alternative solution

An alternative solution, which could be useful in certain situations. For example when plotting something and you do not want to create new copies of DataFrames just to filter out some rows and plot it. An added bonus is that the column dtypes in this solution are not limited to numbers, wherever you can do a comparison you can just add it to the eval call.

Guided answer from the start, first to read the data I use the pandas.read_clipboard() function.

c0 c1 c2 c3 c4 c5 c6 c7
manwra sahAyaH T7 0 0 0 0 T
manwra akriti T5 0 0 1 0 K 
awma prabrtih B6 0 1 1 0 S

I added column names c0-c7 here. Just select this table and copy Ctrl+C it to the clipboard.

import pandas as pd 
df = pd.read_clipboard()

and then by chaining an .eval() with a .query() on the DataFrame

df.eval("all_zero = (c3==0)&(c4==0)&(c5==0)&(c6==0)").query("all_zero==False")

we get the desired results.

c0 c1 c2 c3 c4 c5 c6 c7 all_zero
1 manwra akriti T5 0 0 1 0 K False
2 awma prabrtih B6 0 1 1 0 S False

In the eval call we are just doing comparisons on the columns we want to be zero, if they are all zero the & sign will combine them to True (1) in column all_zero, then we can query the results and say we only want rows where this new column is False (we could as well say all_zero!=True, or changed the comparison in the eval call).

If you do not want the all_zero column just drop that as well

df.eval("all_zero = (c3==0)&(c4==0)&(c5==0)&(c6==0)").query("all_zero==False").drop('all_zero',axis=1)
c0 c1 c2 c3 c4 c5 c6 c7
1 manwra akriti T5 0 0 1 0 K
2 awma prabrtih B6 0 1 1 0 S

Timings / Speed comparison

Comparing the speed of the solutions on a bigger DataFrame of same layout we see that my solution is slower (as expected), but it has some added benefits in terms of filtering more broadly. To compare timings I replicated the DataFrame so that it contains 30 000 rows instead

df = df.append([df]*(10000-1),ignore_index=True)

and then ran the different solutions on it and timed it.

Quick timing comparison for the different solutions

I will leave my solution here for reference!

like image 108
Magnus Persson Avatar answered Oct 06 '22 20:10

Magnus Persson


Here is an alternative solution which uses select_dtypes() method:

In [41]: df[(df.select_dtypes(include=['number']) != 0).any(1)]
Out[41]:
        0          1   2  3  4  5  6  7
1  manwra     akriti  T5  0  0  1  0  K
2    awma   prabrtih  B6  0  1  1  0  S

Explanation:

In [42]: df.select_dtypes(include=['number']) != 0
Out[42]:
       3      4      5      6
0  False  False  False  False
1  False  False   True  False
2  False   True   True  False

In [43]: (df.select_dtypes(include=['number']) != 0).any(1)
Out[43]:
0    False
1     True
2     True
dtype: bool
like image 22
MaxU - stop WAR against UA Avatar answered Oct 06 '22 20:10

MaxU - stop WAR against UA