Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: how to do multiple groupby-apply operations

I have more experience with R’s data.table, but am trying to learn pandas. In data.table, I can do something like this:

> head(dt_m)
   event_id           device_id longitude latitude               time_ category
1:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
2:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
3:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
4:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
5:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
6:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
                 app_id is_active
1: -5305696816021977482         0
2: -7164737313972860089         0
3: -8504475857937456387         0
4: -8807740666788515175         0
5:  5302560163370202064         0
6:  5521284031585796822         0


dt_m_summary <- dt_m[,
                     .(
                       mean_active = mean(is_active, na.rm = TRUE)
                       , median_lat = median(latitude, na.rm = TRUE)
                       , median_lon = median(longitude, na.rm = TRUE)
                       , mean_time = mean(time_)
                       , new_col = your_function(latitude, longitude, time_)
                     )
                     , by = list(device_id, category)
                     ]

The new columns (mean_active through new_col), as well as device_id and category, will appear in dt_m_summary. I could also do a similar by transformation in the original table if I want a new column that has the results of the groupby-apply:

dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]

(in case I wanted, e.g., to select rows where mean_active is greater than some threshold, or do something else).

I know there is groupby in pandas, but I haven’t found a way of doing the sort of easy transformations as above. The best I could think of was doing a series of groupby-apply’s and then merging the results into one dataframe, but that seems very clunky. Is there a better way of doing that?

like image 428
Anarcho-Chossid Avatar asked Aug 26 '16 06:08

Anarcho-Chossid


1 Answers

IIUC, use groupby and agg. See docs for more information.

df = pd.DataFrame(np.random.rand(10, 2),
                  pd.MultiIndex.from_product([list('XY'), range(5)]),
                  list('AB'))

df

enter image description here

df.groupby(level=0).agg(['sum', 'count', 'std'])

enter image description here


A more tailored example would be

# level=0 means group by the first level in the index
# if there is a specific column you want to group by
# use groupby('specific column name')
df.groupby(level=0).agg({'A': ['sum', 'std'],
                         'B': {'my_function': lambda x: x.sum() ** 2}})

enter image description here

Note the dict passed to the agg method has keys 'A' and 'B'. This means, run the functions ['sum', 'std'] for 'A' and lambda x: x.sum() ** 2 for 'B' (and label it 'my_function')

Note 2 pertaining to your new_column. agg requires that the passed functions reduce columns to scalars. You're better off adding the new column ahead of the groupby/agg

like image 56
piRSquared Avatar answered Oct 17 '22 21:10

piRSquared