I have more experience with R’s data.table
, but am trying to learn pandas
. In data.table
, I can do something like this:
> head(dt_m)
event_id device_id longitude latitude time_ category
1: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
2: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
3: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
4: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
5: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
6: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free
app_id is_active
1: -5305696816021977482 0
2: -7164737313972860089 0
3: -8504475857937456387 0
4: -8807740666788515175 0
5: 5302560163370202064 0
6: 5521284031585796822 0
dt_m_summary <- dt_m[,
.(
mean_active = mean(is_active, na.rm = TRUE)
, median_lat = median(latitude, na.rm = TRUE)
, median_lon = median(longitude, na.rm = TRUE)
, mean_time = mean(time_)
, new_col = your_function(latitude, longitude, time_)
)
, by = list(device_id, category)
]
The new columns (mean_active
through new_col
), as well as device_id
and category
, will appear in dt_m_summary
. I could also do a similar by
transformation in the original table if I want a new column that has the results of the groupby-apply:
dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]
(in case I wanted, e.g., to select rows where mean_active
is greater than some threshold, or do something else).
I know there is groupby
in pandas
, but I haven’t found a way of doing the sort of easy transformations as above. The best I could think of was doing a series of groupby-apply’s and then merging the results into one dataframe
, but that seems very clunky. Is there a better way of doing that?
IIUC, use groupby
and agg
. See docs for more information.
df = pd.DataFrame(np.random.rand(10, 2),
pd.MultiIndex.from_product([list('XY'), range(5)]),
list('AB'))
df
df.groupby(level=0).agg(['sum', 'count', 'std'])
A more tailored example would be
# level=0 means group by the first level in the index
# if there is a specific column you want to group by
# use groupby('specific column name')
df.groupby(level=0).agg({'A': ['sum', 'std'],
'B': {'my_function': lambda x: x.sum() ** 2}})
Note the dict
passed to the agg
method has keys 'A'
and 'B'
. This means, run the functions ['sum', 'std']
for 'A'
and lambda x: x.sum() ** 2
for 'B'
(and label it 'my_function'
)
Note 2 pertaining to your new_column
. agg
requires that the passed functions reduce columns to scalars. You're better off adding the new column ahead of the groupby
/agg
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With