I have a large data set with the following structure
User X
1 0
1 0
2 0
2 0
2 1
3 0
3 0
I would like to take a subset of the data such that the sum of column X for each User is 0. Given the above example, the subset should only include the observations for users 1 and 3 as follows
User X
1 0
1 0
3 0
3 0
Is there a way to do this using the groupby function without grouping the data? I want the subset to include the individual observations.
As an alternative to @unutbu's answer, there's also
>>> df.loc[df.groupby("User")["X"].transform(sum) == 0]
User X
0 1 0
1 1 0
5 3 0
6 3 0
This creates a df
-length boolean Series to use as a selector:
>>> df.groupby("User")["X"].transform(sum) == 0
0 True
1 True
2 False
3 False
4 False
5 True
6 True
dtype: bool
transform
is used when you want to "broadcast" the result of a groupby reduction operation back up to all the elements of each group. It comes in handy.
DSM's answer, which selects rows using a boolean mask, works well even if the DataFrame has a non-unique index. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.
@roland: Please consider accepting DSM's answer instead.
You could use a groupby-filter
:
In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
Out[16]:
User X
0 1 0
1 1 0
5 3 0
6 3 0
By itself, the groupby-filter just returns this:
In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0)
Out[29]:
0 0
1 0
5 0
6 0
Name: X, dtype: int64
but you can then use its index,
In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index
Out[30]: Int64Index([0, 1, 5, 6], dtype='int64')
to select the desired rows using df.loc
.
Here is the benchmark I used:
In [49]: df2 = pd.concat([df]*10000) # df2 has a non-unique index
I Ctrl-C'd this one because it was taking too long to finish:
In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
When I realized my mistake, I made a DataFrame with a unique index:
In [51]: df3 = df2.reset_index() # this gives df3 a unique index
In [52]: %timeit df3.loc[df3.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
100 loops, best of 3: 13 ms per loop
In [53]: %timeit df3.loc[df3.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.4 ms per loop
This shows DSM's method performs well even with a non-unique index:
In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.2 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With