Python pandas - select rows based on groupby

Question

I have a sample table like this:

Dataframe: df

Col1     Col2    Col3    Col4
A   1   10  i
A   1   11  k
A   1   12  a
A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d
B   2   21  w
B   2   25  e
B   2   36  q
C   1   23  a
C   1   24  b

I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:

A   2   10  w
A   2   11  e
B   1   15  s
B   1   16  d

since group (A,2) has 2 records compared to group (A,1) which has 3 records.

I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it. Any advise on how to approach this would be greatly appreciated.

I had this to get the groups for which I wanted the rows displayed:

    groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
    print filteredGroups.groupby(level=0).agg('idxmin')

The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.

JohnE · Accepted Answer

df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")

df['rnk']     = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)

df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]

      Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
3    A     2    10    w   2  1.0      4.0
4    A     2    11    e   2  1.0      4.0
5    B     1    15    s   2  1.0      4.0
6    B     1    16    d   2  1.0      4.0

Edit: changed "count" to "size" (as in @Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.

And for clarity, here's what the df looks like before dropping the selected rows.

   Col1  Col2  Col3 Col4  sz  rnk  rnk_rev
0     A     1    10    i   3  3.0      1.0
1     A     1    11    k   3  3.0      1.0
2     A     1    12    a   3  3.0      1.0
3     A     2    10    w   2  1.0      4.0
4     A     2    11    e   2  1.0      4.0
5     B     1    15    s   2  1.0      4.0
6     B     1    16    d   2  1.0      4.0
7     B     2    21    w   3  3.0      1.0
8     B     2    25    e   3  3.0      1.0
9     B     2    36    q   3  3.0      1.0
10    C     1    23    a   2  1.0      1.0
11    C     1    24    b   2  1.0      1.0

Python pandas - select rows based on groupby

Tags:

python

pandas

group-by

Ant Smith

1 Answers

JohnE

Recent Activity

Donate For Us

Python pandas - select rows based on groupby

Tags:

python

pandas

group-by

Ant Smith

1 Answers

JohnE

Related questions

Recent Activity

Donate For Us