Simultaneous operation of groupby and resample on pandas dataframe?

Question

My pandas dataframe consists of a categorical column JOB_TITLE, a numeric column BASE_SALARY and a datetime index JOIN_DATE. I'd like to perform an aggregation over the groups of the categorical and downsampled datetimes as follows:

# Resampled at frequency of start data of every 5 years
mean_agg = (df
          .groupby('JOB_TITLE')
          .resample('5AS')['BASE_SALARY']
          .mean())

Unfortunately, as the groupby operation is occurring before the resample, the resample operation is performed independently for each JOB_TITLE group. This results in following Series:

| JOB_TITLE         | JOIN_DATE  |       |
|-------------------|------------|-------|
| Data Scientist    | 2004-01-01 | 60000 |
|                   | 2009-01-01 | 75000 |
|                   | 2014-01-01 | 90000 |
|                   |            |       |
| Software Engineer | 2001-01-01 | 70000 |
|                   | 2006-01-01 | 85000 |
|                   | 2011-01-01 | 90000 |
|                   | 2016-01-01 | 85000 |

As you can see the indexes at JOIN_DATE level for Data Scientist group and Software Engineer are not aligned. This creates a problem when you apply unstack for level JOB_TITLE as follows:

mean_agg.unstack('JOB_TITLE')

This results in the following dataframe:

| JOB_TITLE  | Data Scientist | Software Engineer |
|------------|----------------|-------------------|
| JOIN_DATE  |                |                   |
| 2001-01-01 | NaN            | 70000             |
| 2004-01-01 | 60000          | NaN               |
| 2006-01-01 | NaN            | 85000             |
| 2009-01-01 | 75000          | NaN               |
| 2011-01-01 | NaN            | 70000             |
| 2014-01-01 | 90000          | NaN               |
| 2016-01-01 | NaN            | 85000             |

How can I avoid this sequential operation of groupby and resample and instead perform a simultaneous operation? Thanks!

Scott Boston · Accepted Answer

Update Pandas 0.21 answer: pd.TimeGrouper is getting deprecated, use pd.Grouper instead.

mean_agg = (df.groupby(['JOB_TITLE',pd.Grouper(freq='5AS')])['BASE_SALARY']
              .mean())

mean_agg.unstack('JOB_TITLE')

Instead of using resample, let's try to use pd.TimeGrouper

mean_agg = (df
      .groupby(['JOB_TITLE',pd.TimeGrouper(freq='5AS')])['BASE_SALARY']
      .mean())

mean_agg.unstack('JOB_TITLE')

TimeGrouper aligns the bins of the grouped time range.

Simultaneous operation of groupby and resample on pandas dataframe?

Tags:

python

pandas

dataframe

group-by

time-series

S. Naribole

1 Answers

Scott Boston

Recent Activity

Donate For Us

Simultaneous operation of groupby and resample on pandas dataframe?

Tags:

python

pandas

dataframe

group-by

time-series

S. Naribole

1 Answers

Scott Boston

Related questions

Recent Activity

Donate For Us