How to group a dataframe by 4 time periods and key

Tags:

I have a dataset that looks something like this:

date    area_key    total_units timeatend   starthour   timedifference  vps
2020-01-15 08:22:39 0   9603    2020-01-15 16:32:39 8   29400.0 0.32663265306122446
2020-01-13 08:22:07 0   10273   2020-01-13 16:25:08 8   28981.0 0.35447362064801075
2020-01-23 07:16:55 3   5175    2020-01-23 14:32:44 7   26149.0 0.19790431756472524
2020-01-15 07:00:06 1   838     2020-01-15 07:46:29 7   2783.0  0.3011139058569889
2020-01-15 08:16:01 1   5840    2020-01-15 12:41:16 8   15915.0 0.3669494187873076

That is then being computed into this to create a kmeans cluster.

def cluster_Volume(inputData):
    
    start_tot = time.time()
    Volume = inputData.groupby(['Startdtm'])['vehiclespersec'].sum().unstack()
    
    
    ## 4 Clusters
    model = clstr.MiniBatchKMeans(n_clusters=5)
    model.fit(Volume.fillna(0))
    Volume['kmeans_4'] = model.predict(Volume.fillna(0))
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))
    
    ## 8 Clusters
    start_tot = time.time()
    model = clstr.KMeans(n_clusters=8)
    model.fit(Volume.fillna(0))
    Volume['kmeans_8'] = model.predict(Volume.fillna(0))
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))
    
    ## Looking at hourly distribution.
    start_tot = time.time()
    Volume_Hourly = Volume.reset_index().set_index(['Startdtm'])
    
    Volume_Hourly['hour'] = Volume_Hourly.index.hour
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))

    return Volume, Volume_Hourly

What I want to do is to make those clusters relate to both time periods and keys.

With the time periods - 7 am to 10 am, and 4 pm to 6 pm, and 12 pm to 2 pm, with 6 pm to 12 am, 12 am to 7 am, and 10 am to 12 pm, 2 pm to 4 pm as other time periods.

And with the keys - showing how each cluster differently in a programmatic way.

Desired Result

The desired result will have a table similar to below, but feel free to develop it in the best way you can think of. Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc.. but feel free to change it as suits - just my thoughts

I've tried a few approaches to this using groupby, but it doesn't seem to work super well - would love some guidance here.

Amazing response, wow cheers. Made me realise I was approaching this incorrectly, but still super valuable for fixing my approach.

This data is the individual occurences as an example.

DateTimeStamp   VS_ID   VS_Summary_Id   Hostname    Vehicle_speed   Lane    Length
11/01/2019 8:22 1   1   place_uno   65  2   71
11/01/2019 8:22 2   1   place_uno   59  1   375
11/01/2019 8:22 3   1   place_uno   59  1   389
11/01/2019 8:22 4   1   place_duo   59  1   832
11/01/2019 8:22 5   1   place_duo   52  1   409

To get volumes I need to aggregate over time in smaller volume blocks (15 second or 15 minute, will post code below).

Then essentially same idea. An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.

Awesome amazing amazing stuff

with those volume calculations per 15 seconds, i want to do the clustering ON those, as the summary table is way too broad based, but i think with what has been linked it should be fine to do that regardless if i rejig it a bit, i had realised the summary table was too broad and therefore doing the time clustering wasn't going to work unless i used this one, so the k-means over time is best with the new data, and the average speed of that cluster is to allow speed to be considered if that makes sense

Thanks again amazing help, will be doing this to fit into below code, but yeah forgot to link it and it could help make it more specific and valuable.

Thanks guys!

872

asked Sep 14 '20 02:09

LeCoda

1 Answers

First Data (Note: the further parts relates to the updates)

Data is very limited, probably due to the complexity to simplify it, so I shall make some assumptions and write this as generic as possible, so you can customize it fast to your needs.

Assumptions:

You want to group by hours-windows ("hour_code") the data (therefore parameterized what data is grouped by as group_divide_set_by_column)
For each hours-windows ("hour_code"), you want to cluster by location using K means algorithm

Doing so allows you to investigate the clusters of vehicles for each hour-window separately, and learn what clustered areas are more active and need attention.

Notes:

Location column (although noted) is missing and required for the K-means algorithm (I used HostName_key but it's just a dummy so code would run, it not necessarily meaningful).
Generally speaking, the K-means algorithm is for spaces with euclidean distance (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.)
Here are some sources for k-means Python examples that are useful to further customize it: 1 2 3 4.

Code:

Let's define a function, which given a dataframe group-divides it by a given column, group_divide_set_by_column.

This would allow us to group-divide by 'hour_code', and then cluster by location.

def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        # Divide to desired amount of clusters
        for clusters_number in clusters_number_list:
            create_cluster(group_df, clusters_number)
        # Setting column types
        set_colum_types(group_df)
    return lst_df_by_groups

The #1 function would use another function to convert hour to hour codes, in similar to how you phrased it:

Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc..

def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
        """
        Divide hours to groups:
        Hours:
         1-5  => 1
         6-8  => 2
         9-11 => 3
        12-14 => 4
        15-17 => 5
        18-20 => 6
        21+   => 7
        """
        if      h < start_threshold:
            return 1
        elif    h >= end_threshold:
            return (end_threshold // windows)
        return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))

Moreover the #1 function would use the set_colum_types function that would convert columns to their matching types:

def set_colum_types(df):
    types_dict = {
        'Startdtm': 'datetime64[ns, Australia/Melbourne]',
        'HostName_key': 'category',
        'Totalvehicles': 'int32',
        'Enddtm': 'datetime64[ns, Australia/Melbourne]',
        'starthour': 'int32',
        'timedelta': 'float',
        'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
        df[col] = df[col].astype(col_type)

A dedicated timeit decorator is used to measure the time for each clustering, so boilerplate code is reduced

Whole Code:

import functools

import pandas as pd

from timeit import default_timer as timer

import sklearn
from sklearn.cluster import KMeans

def timeit(func):
    @functools.wraps(func)
    def newfunc(*args, **kwargs):
        startTime = timer()
        func(*args, **kwargs)
        elapsedTime = timer() - startTime
        print('function [{}] finished in {} ms'.format(
            func.__name__, int(elapsedTime * 1000)))
    return newfunc

def set_colum_types(df):
    types_dict = {
        'Startdtm': 'datetime64[ns, Australia/Melbourne]',
        'HostName_key': 'category',
        'Totalvehicles': 'int32',
        'Enddtm': 'datetime64[ns, Australia/Melbourne]',
        'starthour': 'int32',
        'timedelta': 'float',
        'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
        df[col] = df[col].astype(col_type)

@timeit
def create_cluster(df, clusters_number):
    # Create K-Means model
    model       = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
    # Fetch location
    # NOTE: Should be a *real* location, used another column as dummy
    location_df = df[['HostName_key']]
    kmeans      = model.fit(location_df)
    # Divide to clusters
    df[f'kmeans_{clusters_number}'] = kmeans.labels_

def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
        """
        Divide hours to groups:
        Hours:
         1-5  => 1
         6-8  => 2
         9-11 => 3
        12-14 => 4
        15-17 => 5
        18-20 => 6
        21+   => 7
        """
        if      h < start_threshold:
            return 1
        elif    h >= end_threshold:
            return (end_threshold // windows)
        return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))

def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        # Divide to desired amount of clusters
        for clusters_number in clusters_number_list:
            create_cluster(group_df, clusters_number)
        # Setting column types
        set_colum_types(group_df)
    return lst_df_by_groups

# Load data
df = pd.read_csv('data.csv')

# Print data
print(df)

# Create clusters
lst_df_by_groups = create_clusters_by_group(df)

# For each hostname-key dataframe
for group_df_name, group_df in lst_df_by_groups.items():
    print(f'Group {group_df_name} dataframe:')
    print(group_df)

Example output:

              Startdtm  HostName_key  ...  timedelta vehiclespersec
0  2020-01-15 08:22:39             0  ...    29400.0       0.326633
1  2020-01-13 08:22:07             2  ...    28981.0       0.354474
2  2020-01-23 07:16:55             3  ...    26149.0       0.197904
3  2020-01-15 07:00:06             4  ...     2783.0       0.301114
4  2020-01-15 08:16:01             1  ...    15915.0       0.366949
5  2020-01-16 08:22:39             2  ...    29400.0       0.326633
6  2020-01-14 08:22:07             2  ...    28981.0       0.354479
7  2020-01-25 07:16:55             4  ...    26149.0       0.197904
8  2020-01-17 07:00:06             1  ...     2783.0       0.301114
9  2020-01-18 08:16:01             1  ...    15915.0       0.366949

[10 rows x 7 columns]
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
Group hour_code_0 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
0 2020-01-15 08:22:39+11:00            0  ...         1        1
1 2020-01-13 08:22:07+11:00            2  ...         0        0
2 2020-01-23 07:16:55+11:00            3  ...         0        2

[3 rows x 10 columns]
Group hour_code_1 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
3 2020-01-15 07:00:06+11:00            4  ...         1        1
4 2020-01-15 08:16:01+11:00            1  ...         0        0
5 2020-01-16 08:22:39+11:00            2  ...         0        2

[3 rows x 10 columns]
Group hour_code_2 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
6 2020-01-14 08:22:07+11:00            2  ...         1        2
7 2020-01-25 07:16:55+11:00            4  ...         0        0
8 2020-01-17 07:00:06+11:00            1  ...         1        1
9 2020-01-18 08:16:01+11:00            1  ...         1        1

[4 rows x 10 columns]

Update : Second Data

So, this time will make things a little different, as the updated objective is to understand how many vehicles are at each place and their speed.

Again, things are written with great care for generically for the ease of adaptation.

First, we divide the data set to groups based on, their location which is inferred by Hostname (parameterized for customization as dividing_colum).

def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

Now, we arrange the data for each hostname (dividing_colum) group:

def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

2.1. We group by intervals of 15 minutes, and after each hostname area data is divided into time intervals, we aggregate the amount of vehicles to column volume and investigate the average speed to column average_speed.

def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_interval(df)
    return df

The end result for stage #2 is that each hostname data is divided into time windows of 15 minutes, and we know how many vehicles have passed each time and what is their average speed.

By this, we achieve the objective:

An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.

Again, all costumizable using [TIME_INTERVAL_COLUMN_NAME, DATE_COLUMN_NAME, INTERVAL_WINDOW].

The whole code:

import functools

import numpy
import pandas as pd

TIME_INTERVAL_COLUMN_NAME   = 'time_interval'

DATE_COLUMN_NAME            = 'DateTimeStamp'

INTERVAL_WINDOW             = '15Min'

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_interval(df)
    return df

def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

# Load data
df = pd.read_csv('data2.csv')

# Print data
print(df)

# Divide by column
df_by_groups = divide_df_by_column(df)

# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)

# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)

Example Output:

We can now get valuable results from measuring the volumes (amount of vehicles) and average speed, for each individual hostname area.

     DateTimeStamp  VS_ID  VS_Summary_Id   Hostname  Vehicle_speed  Lane  Length
0  11/01/2019 8:22      1              1  place_uno             65     2      71
1  11/01/2019 8:23      2              1  place_uno             59     1     375
2  11/01/2019 8:25      3              1  place_uno             59     1     389
3  11/01/2019 8:26      4              1  place_duo             59     1     832
4  11/01/2019 8:40      5              1  place_duo             52     1     409
Group Hostname_place_duo dataframe:
                     average_speed  volume
DateTimeStamp                             
2019-11-01 08:15:00             59       1
2019-11-01 08:30:00             52       1
Group Hostname_place_uno dataframe:
                     average_speed  volume
DateTimeStamp                             
2019-11-01 08:15:00             61       3

Appendix

Created also a round_time function, which allows to round to time intervals, without grouping:

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

Third Update

So this time we want to reduce the number of rows in the result.

We change the way we group the data, not only based on interval but also for the day in week, the result would allow us investigate in how traffic behaves for each day of the week and it's 15-minutes intervals. The group_by_interval function is now changed to group on the concise inteval thus, will be called group_by_concised_interval.

We shall call the combination of [day-in-week, hour-minute] as "consice interval", again this is configurable with CONCISE_INTERVAL_FORMAT.

def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

1.1. The group_by_concised_interval first rounds time to the given 15-minutes interval (configurable via INTERVAL_WINDOW) using the round_time method.

1.2. After creating the time intervals for each date, we apply the add_consice_interval_columns function that given the rounded to inteval time stamp, extracts the concise form.

def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))

The whole code is:

import functools

import numpy
import pandas as pd

TIME_INTERVAL_COLUMN_NAME           = 'time_interval'

TIME_INTERVAL_CONCISE_COLUMN_NAME   = 'time_interval_concise'

DATE_COLUMN_NAME                    = 'DateTimeStamp'

INTERVAL_WINDOW                     = '15Min'

CONCISE_INTERVAL_FORMAT             = '%A %H:%M'

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))

def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_concised_interval(df)
    return df

def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

# Load data
df = pd.read_csv('data2.csv')

# Print data
print(df)

# Divide by column
df_by_groups = divide_df_by_column(df)

# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)

# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)

Output:

Group Hostname_place_duo dataframe:
                       average_speed  volume
time_interval_concise                       
Friday 08:30                      59       1
Friday 08:45                      52       1
Group Hostname_place_uno dataframe:
                       average_speed  volume
time_interval_concise                       
Friday 08:15                      65       1
Friday 08:30                      59       2

So now we can easily figure out how traffic behaves in each day of the week at all available time intervals.

192

answered Oct 26 '22 15:10

Aviv Yaniv

Related questions
                            
                                A quick way to write a decision into a column based on the corresponding rows using pandas?
                            
                                Changing in the Quantity of variants reflecting in the wrong item in Order Summary
                            
                                Google Collab How to show value of assignments?
                            
                                Even though tuples are immutable, they are stored in different addresses in interactive mode. Why?
                            
                                Delete an element from torch.Tensor
                            
                                Why does django's `apps.get_model()` return a `__fake__.MyModel` object
                            
                                ValueError: illegal value in 4-th argument of internal None when running sklearn LinearRegression().fit()
                            
                                how to download all the python packages mentioned in the requirement.txt to a folder in linux?
                            
                                Create CSV from XML/Json using Python Pandas
                            
                                Length (count) of sequences with start and end condition Python
                            
                                Unexpected number of bins in Pandas DataFrame resample
                            
                                Convert epoch, which is midnight 01/01/0001, to DateTime in pandas
                            
                                matplotlib text: Use data coords for x, axis coords for y
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                Trying to understand __init__.py combined with getattr
                            
                                Implementing inplace operations for methods in a class
                            
                                How can I list the extra features of a Python package
                            
                                multiprocessing in python - what gets inherited by forkserver process from parent process?
                            
                                Get hour of year from a Datetime
                            
                                How to terminate loop.run_in_executor with ProcessPoolExecutor gracefully?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to group a dataframe by 4 time periods and key

Tags:

python

pandas

pandas-groupby

scikit-learn