I have a datframe with 4 columns of strings and others as integers. Now I need to find out those rows of data where at least one of the column is a non-zero value (or > 0).
manwra,sahAyaH,T7,0,0,0,0,T
manwra, akriti,T5,0,0,1,0,K
awma, prabrtih,B6, 0,1,1,0,S
My output should be
manwra, akriti,T5,0,0,1,0,K
awma, prabrtih,B6, 0,1,1,0,S
I have tried the following to obtain the answer. The string values are in colums 0,1,2 and -1 (last column).
KT[KT.ix[:,3:-2] != 0]
What I am receiving as output is
NaN,NaNNaN,NaN,NaN,NaN,NaN,NaN
NaN,NaN,NaN,NaN,NaN,1,NaN,NaN
NaN,NaN,NaN,NaN,1,1,NaN,NaN
How to obtain the desired output
An alternative solution, which could be useful in certain situations. For example when plotting something and you do not want to create new copies of DataFrames just to filter out some rows and plot it. An added bonus is that the column dtypes in this solution are not limited to numbers, wherever you can do a comparison you can just add it to the eval call.
Guided answer from the start, first to read the data I use the pandas.read_clipboard()
function.
c0 c1 c2 c3 c4 c5 c6 c7
manwra sahAyaH T7 0 0 0 0 T
manwra akriti T5 0 0 1 0 K
awma prabrtih B6 0 1 1 0 S
I added column names c0-c7 here. Just select this table and copy Ctrl+C it to the clipboard.
import pandas as pd
df = pd.read_clipboard()
and then by chaining an .eval()
with a .query()
on the DataFrame
df.eval("all_zero = (c3==0)&(c4==0)&(c5==0)&(c6==0)").query("all_zero==False")
we get the desired results.
c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | all_zero | |
---|---|---|---|---|---|---|---|---|---|
1 | manwra | akriti | T5 | 0 | 0 | 1 | 0 | K | False |
2 | awma | prabrtih | B6 | 0 | 1 | 1 | 0 | S | False |
In the eval call we are just doing comparisons on the columns we want to be zero, if they are all zero the &
sign will combine them to True (1) in column all_zero
, then we can query the results and say we only want rows where this new column is False (we could as well say all_zero!=True
, or changed the comparison in the eval call).
If you do not want the all_zero
column just drop that as well
df.eval("all_zero = (c3==0)&(c4==0)&(c5==0)&(c6==0)").query("all_zero==False").drop('all_zero',axis=1)
c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | |
---|---|---|---|---|---|---|---|---|
1 | manwra | akriti | T5 | 0 | 0 | 1 | 0 | K |
2 | awma | prabrtih | B6 | 0 | 1 | 1 | 0 | S |
Comparing the speed of the solutions on a bigger DataFrame of same layout we see that my solution is slower (as expected), but it has some added benefits in terms of filtering more broadly. To compare timings I replicated the DataFrame so that it contains 30 000 rows instead
df = df.append([df]*(10000-1),ignore_index=True)
and then ran the different solutions on it and timed it.
I will leave my solution here for reference!
Here is an alternative solution which uses select_dtypes() method:
In [41]: df[(df.select_dtypes(include=['number']) != 0).any(1)]
Out[41]:
0 1 2 3 4 5 6 7
1 manwra akriti T5 0 0 1 0 K
2 awma prabrtih B6 0 1 1 0 S
Explanation:
In [42]: df.select_dtypes(include=['number']) != 0
Out[42]:
3 4 5 6
0 False False False False
1 False False True False
2 False True True False
In [43]: (df.select_dtypes(include=['number']) != 0).any(1)
Out[43]:
0 False
1 True
2 True
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With