Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for simpler solution to group by and select rows in pandas

Tags:

python

pandas

df:

id c1 c2 c3
101  a b c
102  b c d
103  d e f
101  h i j
102  k l m

I want to select rows based on grouping on id column where count > 1

The result should be all rows whose id had more than 1 entry

Expected result:

df:

id c1 c2 c3
101  a b c
102  b c d
101  h i j
102  k l m

I am able to achieve this with below code I wrote.

g = df.groupby('id').size().reset_index(name='counts')
filt = g.query('counts > 1')
m_filt = df.id.isin (filt.id)
df_filtered= df[m_filt]

Wanted to check if there is a better way of doing this.

like image 449
Harikrishnan Balachandran Avatar asked Sep 01 '19 17:09

Harikrishnan Balachandran


People also ask

How do I group specific rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

How do you select a group of rows in Python?

To access a group of rows in a Pandas DataFrame, we can use the loc() method. For example, if we use df. loc[2:5], then it will select all the rows from 2 to 5.

How do I select top 10 rows in pandas DataFrame?

Use pandas. DataFrame. head(n) to get the first n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the start).


1 Answers

Use GroupBy.transform with GroupBy.size for Series with same size like original DataFrame, so possible filter by boolean indexing:

df[df.groupby('id').transform('size')['id'].gt(1)]

Or if need all duplicated rows use DataFrame.duplicated with keep=False:

df[df.duplicated('id', keep=False)]

Or similar:

df[df['id'].duplicated(keep=False)]
like image 188
jezrael Avatar answered Oct 02 '22 14:10

jezrael