df: <pre class="prettyprint"><code>id c1 c2 c3 101 a b c 102 b c d 103 d e f 101 h i j 102 k l m </code></pre> I want to select rows based on grouping on <code>id</code> column where <code>count > 1</code> The result should be all rows whose <code>id</code> had more than 1 entry Expected result: df: <pre class="prettyprint"><code>id c1 c2 c3 101 a b c 102 b c d 101 h i j 102 k l m </code></pre> I am able to achieve this with below code I wrote. <pre class="prettyprint"><code>g = df.groupby('id').size().reset_index(name='counts') filt = g.query('counts > 1') m_filt = df.id.isin (filt.id) df_filtered= df[m_filt] </code></pre> Wanted to check if there is a better way of doing this.

Use <code>GroupBy.transform</code> with <code>GroupBy.size</code> for <code>Series</code> with same size like original <code>DataFrame</code>, so possible filter by <code>boolean indexing</code>: <pre class="prettyprint"><code>df[df.groupby('id').transform('size')['id'].gt(1)] </code></pre> Or if need all duplicated rows use <code>DataFrame.duplicated</code> with <code>keep=False</code>: <pre class="prettyprint"><code>df[df.duplicated('id', keep=False)] </code></pre> Or similar: <pre class="prettyprint"><code>df[df['id'].duplicated(keep=False)] </code></pre>

Looking for simpler solution to group by and select rows in pandas

Tags:

python

pandas

df:

id c1 c2 c3
101  a b c
102  b c d
103  d e f
101  h i j
102  k l m

I want to select rows based on grouping on id column where count > 1

The result should be all rows whose id had more than 1 entry

Expected result:

df:

id c1 c2 c3
101  a b c
102  b c d
101  h i j
102  k l m

I am able to achieve this with below code I wrote.

g = df.groupby('id').size().reset_index(name='counts')
filt = g.query('counts > 1')
m_filt = df.id.isin (filt.id)
df_filtered= df[m_filt]

Wanted to check if there is a better way of doing this.

449

asked Sep 01 '19 17:09

Harikrishnan Balachandran

1 Answers

Use GroupBy.transform with GroupBy.size for Series with same size like original DataFrame, so possible filter by boolean indexing:

df[df.groupby('id').transform('size')['id'].gt(1)]

Or if need all duplicated rows use DataFrame.duplicated with keep=False:

df[df.duplicated('id', keep=False)]

Or similar:

df[df['id'].duplicated(keep=False)]

188

answered Oct 02 '22 14:10

jezrael

Related questions
                            
                                How to organize Python code into collapsable / expandable chunks?
                            
                                Updating gui items withing the process
                            
                                Decorator class and missing required positional arguments
                            
                                How to apply the Hurst Exponent in Python in a rolling window
                            
                                How to plot 5 subplots in two rows using matplotlib or seaborn? [duplicate]
                            
                                How to test authenticated POST request with Pytest in Django
                            
                                Creating a Glue job with AWS CDK (python) fails
                            
                                TypeError: take(): argument 'index' (position 1) must be Tensor, not numpy.ndarray
                            
                                Issue with JWT token authentication in PyGithub
                            
                                Is it possible to / How to get the c++ code generated from running pythran on python
                            
                                Why is Firestore rounding 64 bit integers?
                            
                                Remove matplotlib depreciation warning from showing
                            
                                Model() got multiple values for argument 'nr_class' - SpaCy multi-classification model (BERT integration)
                            
                                Why doesn't this code produce shapes with random colors?
                            
                                How to find minimum value in a column based on condition in an another column of a dataframe?
                            
                                How to parametrize tests with json array test data using pytest in python?
                            
                                How do I make the width of the title box span the entire plot?
                            
                                Tab-completion in Python interpreter in OS X Terminal
                            
                                Using DPAPI with Python?
                            
                                Pydev Code Completion for everything

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With