I have the following dataframe in Pandas
letter number
------ -------
a 2
a 0
b 1
b 5
b 2
c 1
c 0
c 2
I'd like to keep all rows if at least one matching number is 0. Result would be:
letter number
------ -------
a 2
a 0
c 1
c 0
c 2
as b has no matching number being 0
What is the best way to do this ? Thanks !
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
You need filtration:
df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Another solution with transform
where get size of 0
rows and filter by boolean indexing
:
print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0 1
1 1
2 0
3 0
4 0
5 1
6 1
7 1
Name: number, dtype: int64
df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
EDIT:
Faster is not use groupby
, better is loc
with isin
:
df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
Comparing with another solution:
In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop
In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
You can also do this without the groupby
by working out which letters to keep then using isin
. I think this is a bit neater personally:
>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
letter number
0 a 2
1 a 0
5 c 1
6 c 0
7 c 2
I suspect this would be faster than doing a groupby
, that may not be relevant here though! A simple timeit
would indicate this is the case:
>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop
>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With