I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers). Here's what I've tried: <pre class="prettyprint"><code>df.groupby(['A']).filter(lambda x: x.count() > min_size) df.groupby(['A']).filter(lambda x: x.size() > min_size) df.groupby(['A']).filter(lambda x: x['A'].count() > min_size) df.groupby(['A']).filter(lambda x: x['A'].size() > min_size) </code></pre> But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

The number of rows is in the attribute <code>.shape[0]</code>: <pre class="prettyprint"><code>df.groupby('A').filter(lambda x: x.shape[0] >= min_size) </code></pre> NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (<code>>=</code>, not <code>></code>).

Pandas groupby then drop groups below specified size

Tags:

python

pandas

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).

Here's what I've tried:

df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)

But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

934

asked Feb 08 '19 00:02

Caleb Jares

2 Answers

You can use len:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
   A  B
0  1  2
1  1  4

130

answered Nov 08 '22 20:11

Andy Hayden

The number of rows is in the attribute .shape[0]:

df.groupby('A').filter(lambda x: x.shape[0] >= min_size)

NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

answered Nov 08 '22 20:11

DYZ

Related questions
                            
                                How to return rows with Null values in pyspark dataframe?
                            
                                Subsetting pandas dataframe and retain original size
                            
                                How to check version 4 UUIDs in python? [closed]
                            
                                How to implement RBF activation function in Keras?
                            
                                Selenium Threads: how to run multi-threaded browser with proxy ( python)
                            
                                What is the recommended way to compute a weighted sum of selected columns of a pandas dataframe?
                            
                                How can I write a function fmap that returns the same type of iterable that was inputted?
                            
                                Django ImageField is not updating when update() method is used
                            
                                Regex to extract ONLY alphanumeric words
                            
                                How to copy only the changed file-contents on the already existed destination file?
                            
                                How to work around Out of bounds nanosecond [duplicate]
                            
                                Is it possible to expand the drawable area around the QSlider
                            
                                Error using HoughCircles with 3-channel input
                            
                                What is the difference between slicing in numpy arrays and slicing a list in Python?
                            
                                SQLAlchemy @property causes 'Unknown Field' error in Marshmallow with dump_only
                            
                                Convert a numpy array to iterator
                            
                                XOR-ing and Summing Two Black and White Images
                            
                                Type(1,) returns int expected tuple
                            
                                Keras: Difference between AveragePooling1D layer and GlobalAveragePooling1D layer
                            
                                Selenium Chrome save as pdf change download folder

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With