I have a DataFrame
like:
Student_id | actvity_timestamp |
---|---|
1001 | 2019-09-05 08:26:12 |
1001 | 2019-09-06 09:26:12 |
1001 | 2019-09-21 10:11:01 |
1001 | 2019-10-24 11:44:01 |
1001 | 2019-10-25 11:31:01 |
1001 | 2019-10-26 12:13:01 |
1002 | 2019-09-11 12:21:01 |
1002 | 2019-09-12 13:11:01 |
1002 | 2019-11-23 16:22:01 |
I want output something like:
Student_id | total_active_days_in_Sept | total_active_days_in_Oct | total_active_days_in_Nov |
---|---|---|---|
1001 | 3 | 3 | 0 |
1002 | 2 | 0 | 1 |
How to achieve this (The months must be taken for the output columns from the actvity_timestamp
)?
You can try doing somthing similar to this:
df = pd.DataFrame.from_dict({
"Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
"actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})
months = pd.to_datetime(df.actvity_timestamp).dt.strftime("%B")
result = pd.crosstab(
df.Student_id,
months,
values=df.activity_timestamp.dt.date,
aggfunc=pd.Series.nunique # These last two parameters make it so that if a Student_id has been active more than once in a single day, to count it only once. (Thanks to @tlentali)
).fillna(0)
Series.dt.strftime works on datetime Series, %B
formats the datetime to only show the month's name.
result
will yield
actvity_timestamp November October September
Student_id
1001 0 3 3
1002 1 0 2
You can arrive at the desired layout (with column names sorted in correct month sequence: 'Sep' -> 'Oct' -> 'Nov' rather than 'Nov' -> 'Oct' -> 'Sep') in the following steps:
1) Create a column with month short name. Then use .pivot_table()
to transform the dataframe (with aggregation on the active dates count in each month under each Student_id
):
df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b') # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date # get activity dates
df_out = (df.pivot_table(index='Student_id', # group under each student id
columns='activity_month', # month short name as new columns
values='activity_date', # aggregate on dates
aggfunc='nunique', #activities on the same date counted once
fill_value=0)
.rename_axis(columns=None)
)
Nov Oct Sep
Student_id
1001 0 3 3
1002 1 0 2
2) Sort the column names of month short name back to calendar sequence by .sort_index
with sort key parameter, as follows:
df_out = df_out.sort_index(axis=1, key=lambda x: pd.to_datetime(x, format='%b').month)
Sep Oct Nov
Student_id
1001 3 3 0
1002 2 0 1
3) Further transform to the desired layout by .add_prefix()
:
df_out = df_out.add_prefix('total_active_days_in_').reset_index()
Result:
print(df_out)
Student_id total_active_days_in_Sep total_active_days_in_Oct total_active_days_in_Nov
0 1001 3 3 0
1 1002 2 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With