Pandas Dataframe selecting groups with minimal cardinality

Question

I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.

I am using the code below

train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()

This works. But the first line, it is very slow :(. 30 minutes or more.

Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.

Thanks in advance Regards

Note: My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe. Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....

Karl D. · Accepted Answer

Use groupby/filter:

>>> df.groupby('id').filter(lambda x: len(x) > cutoff)

This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:

In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop

Pandas Dataframe selecting groups with minimal cardinality

Tags:

python

pandas

dataframe

data-processing

Run2

1 Answers

Karl D.

Recent Activity

Donate For Us

Pandas Dataframe selecting groups with minimal cardinality

Tags:

python

pandas

dataframe

data-processing

Run2

1 Answers

Karl D.

Related questions

Recent Activity

Donate For Us