Pandas: Keep rows if at least one of them contains certain value

Tags:

python

pandas

I have the following dataframe in Pandas

letter  number
------ -------
a       2
a       0
b       1
b       5
b       2
c       1
c       0
c       2

I'd like to keep all rows if at least one matching number is 0. Result would be:

letter  number
------ -------
a       2
a       0
c       1
c       0
c       2

as b has no matching number being 0

What is the best way to do this ? Thanks !

307

asked Apr 03 '17 11:04

user2475110

2 Answers

You need filtration:

df = df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Another solution with transform where get size of 0 rows and filter by boolean indexing:

print (df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()))
0    1
1    1
2    0
3    0
4    0
5    1
6    1
7    1
Name: number, dtype: int64

df = df[df.groupby('letter')['number'].transform(lambda x: (x == 0).sum()) > 0]
print (df)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

EDIT:

Faster is not use groupby, better is loc with isin:

df1 = df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
print (df1)
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

Comparing with another solution:

In [412]: %timeit df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 815 µs per loop

In [413]: %timeit df[df['letter'].isin(df.loc[df['number'] == 0, 'letter'])]
1000 loops, best of 3: 657 µs per loop

answered Nov 06 '22 07:11

jezrael

You can also do this without the groupby by working out which letters to keep then using isin. I think this is a bit neater personally:

>>> letters_to_keep = df[df['number'] == 0]['letter']
>>> df_reduced = df[df['letter'].isin(letters_to_keep)]
>>> df_reduced
  letter  number
0      a       2
1      a       0
5      c       1
6      c       0
7      c       2

I suspect this would be faster than doing a groupby, that may not be relevant here though! A simple timeit would indicate this is the case:

>>> %%timeit
... df.groupby('letter').filter(lambda x: (x['number'] == 0).any())
100 loops, best of 3: 2.26 ms per loop

>>> %%timeit
... df[df['letter'].isin(df[df['number'] == 0]['letter'])]
1000 loops, best of 3: 820 µs per loop

answered Nov 06 '22 09:11

bastewart

Related questions
                            
                                'Resource exhausted' memory error when trying to train a Keras model
                            
                                Should the main function and main() be placed at the start or the end of the program?
                            
                                Why use LSA before K-Means when doing text clustering
                            
                                Why is the value of a `tf.constant()` stored multiple times in memory in TensorFlow?
                            
                                Is there a way to keep Telegram bot running when closing Python? [duplicate]
                            
                                How can I get all the hops in a path of unknown length with neo4j-python?
                            
                                Add column to pandas without headers
                            
                                How to apply a persistent coordinate transformation to Matplotlib Patches?
                            
                                saving large data set PCA on disk for later use with limited disc space
                            
                                how to implement counters in hadoop streaming in python
                            
                                How do I display a dialog that asks the user multi-choice question using tkInter?
                            
                                Global Python packages in Sublime Text plugin development
                            
                                How do I access data from a python thread
                            
                                Python Pandas - How to filter multiple columns by one value
                            
                                Why does defaultdict default_factory default to None?
                            
                                Dict: get() not returning 0 if dict value contains None
                            
                                Python import from parent directory and keep flake8 happy
                            
                                "Must explicitly set engine if not passing in buffer or path for io" in Panda
                            
                                Kafka 10 - Python Client with Authentication and Authorization
                            
                                Replace For Loop with Numpy Vectorized Operation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With