Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the active dates in every month or in a range of month

I have a DataFrame like:

Student_id actvity_timestamp
1001 2019-09-05 08:26:12
1001 2019-09-06 09:26:12
1001 2019-09-21 10:11:01
1001 2019-10-24 11:44:01
1001 2019-10-25 11:31:01
1001 2019-10-26 12:13:01
1002 2019-09-11 12:21:01
1002 2019-09-12 13:11:01
1002 2019-11-23 16:22:01

I want output something like:

Student_id total_active_days_in_Sept total_active_days_in_Oct total_active_days_in_Nov
1001 3 3 0
1002 2 0 1

How to achieve this (The months must be taken for the output columns from the actvity_timestamp)?

like image 462
Ranyk Avatar asked Jul 12 '21 10:07

Ranyk


2 Answers

You can try doing somthing similar to this:

df = pd.DataFrame.from_dict({
    "Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
    "actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})

months = pd.to_datetime(df.actvity_timestamp).dt.strftime("%B")

result = pd.crosstab(
    df.Student_id,
    months,
    values=df.activity_timestamp.dt.date,
    aggfunc=pd.Series.nunique # These last two parameters make it so that if a Student_id has been active more than once in a single day, to count it only once. (Thanks to @tlentali)
).fillna(0)

Series.dt.strftime works on datetime Series, %B formats the datetime to only show the month's name.

result will yield

actvity_timestamp  November  October  September
Student_id                                     
1001                      0        3          3
1002                      1        0          2
like image 54
Xelvoz Avatar answered Nov 15 '22 18:11

Xelvoz


You can arrive at the desired layout (with column names sorted in correct month sequence: 'Sep' -> 'Oct' -> 'Nov' rather than 'Nov' -> 'Oct' -> 'Sep') in the following steps:

1) Create a column with month short name. Then use .pivot_table() to transform the dataframe (with aggregation on the active dates count in each month under each Student_id):

df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format 
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b')  # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date     # get activity dates

df_out = (df.pivot_table(index='Student_id',   # group under each student id
                         columns='activity_month',  # month short name as new columns
                         values='activity_date',    # aggregate on dates
                         aggfunc='nunique',    #activities on the same date counted once
                         fill_value=0)
            .rename_axis(columns=None)                          
         )


            Nov  Oct  Sep
Student_id               
1001          0    3    3
1002          1    0    2

2) Sort the column names of month short name back to calendar sequence by .sort_index with sort key parameter, as follows:

df_out = df_out.sort_index(axis=1, key=lambda x: pd.to_datetime(x, format='%b').month)


            Sep  Oct  Nov
Student_id               
1001          3    3    0
1002          2    0    1

3) Further transform to the desired layout by .add_prefix():

df_out = df_out.add_prefix('total_active_days_in_').reset_index()

Result:

print(df_out)

   Student_id  total_active_days_in_Sep  total_active_days_in_Oct  total_active_days_in_Nov
0        1001                         3                         3                         0
1        1002                         2                         0                         1
like image 42
SeaBean Avatar answered Nov 15 '22 18:11

SeaBean