Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Keep rows if at least one of them contains certain value

Tags:

python

pandas

I have the following dataframe in Pandas

letter  number
------ -------
a       2
a       0
b       1
b       5
b       2
c       1
c       0
c       2

I'd like to keep all rows if at least one matching number is 0. Result would be:

letter  number
------ -------
a       2
a       0
c       1
c       0
c       2

as b has no matching number being 0

What is the best way to do this ? Thanks !

like image 307
user2475110 Avatar asked Apr 03 '17 11:04

user2475110


People also ask

What is the difference between ILOC () and LOC ()?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.


2 Answers

You need filtration:

df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Another solution with transform where get size of 0 rows and filter by boolean indexing:

print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0    1
1    1
2    0
3    0
4    0
5    1
6    1
7    1
Name: number, dtype: int64

df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

EDIT:

Faster is not use groupby, better is loc with isin:

df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Comparing with another solution:

In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop

In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop
like image 86
jezrael Avatar answered Nov 06 '22 07:11

jezrael


You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:

>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:

>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop

>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop
like image 30
bastewart Avatar answered Nov 06 '22 09:11

bastewart