Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group a dataframe by 4 time periods and key

I have a dataset that looks something like this:

date    area_key    total_units timeatend   starthour   timedifference  vps
2020-01-15 08:22:39 0   9603    2020-01-15 16:32:39 8   29400.0 0.32663265306122446
2020-01-13 08:22:07 0   10273   2020-01-13 16:25:08 8   28981.0 0.35447362064801075
2020-01-23 07:16:55 3   5175    2020-01-23 14:32:44 7   26149.0 0.19790431756472524
2020-01-15 07:00:06 1   838     2020-01-15 07:46:29 7   2783.0  0.3011139058569889
2020-01-15 08:16:01 1   5840    2020-01-15 12:41:16 8   15915.0 0.3669494187873076

That is then being computed into this to create a kmeans cluster.

def cluster_Volume(inputData):
    
    start_tot = time.time()
    Volume = inputData.groupby(['Startdtm'])['vehiclespersec'].sum().unstack()
    
    
    ## 4 Clusters
    model = clstr.MiniBatchKMeans(n_clusters=5)
    model.fit(Volume.fillna(0))
    Volume['kmeans_4'] = model.predict(Volume.fillna(0))
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))
    
    ## 8 Clusters
    start_tot = time.time()
    model = clstr.KMeans(n_clusters=8)
    model.fit(Volume.fillna(0))
    Volume['kmeans_8'] = model.predict(Volume.fillna(0))
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))
    
    ## Looking at hourly distribution.
    start_tot = time.time()
    Volume_Hourly = Volume.reset_index().set_index(['Startdtm'])
    
    Volume_Hourly['hour'] = Volume_Hourly.index.hour
    end_tot = time.time()
    print("Completed in " + str(end_tot-start_tot))

    return Volume, Volume_Hourly

What I want to do is to make those clusters relate to both time periods and keys.

With the time periods - 7 am to 10 am, and 4 pm to 6 pm, and 12 pm to 2 pm, with 6 pm to 12 am, 12 am to 7 am, and 10 am to 12 pm, 2 pm to 4 pm as other time periods.

And with the keys - showing how each cluster differently in a programmatic way.

Desired Result

The desired result will have a table similar to below, but feel free to develop it in the best way you can think of. Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc.. but feel free to change it as suits - just my thoughts

I've tried a few approaches to this using groupby, but it doesn't seem to work super well - would love some guidance here.

Amazing response, wow cheers. Made me realise I was approaching this incorrectly, but still super valuable for fixing my approach.

This data is the individual occurences as an example.

DateTimeStamp   VS_ID   VS_Summary_Id   Hostname    Vehicle_speed   Lane    Length
11/01/2019 8:22 1   1   place_uno   65  2   71
11/01/2019 8:22 2   1   place_uno   59  1   375
11/01/2019 8:22 3   1   place_uno   59  1   389
11/01/2019 8:22 4   1   place_duo   59  1   832
11/01/2019 8:22 5   1   place_duo   52  1   409

To get volumes I need to aggregate over time in smaller volume blocks (15 second or 15 minute, will post code below).

Then essentially same idea. An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.

Awesome amazing amazing stuff

with those volume calculations per 15 seconds, i want to do the clustering ON those, as the summary table is way too broad based, but i think with what has been linked it should be fine to do that regardless if i rejig it a bit, i had realised the summary table was too broad and therefore doing the time clustering wasn't going to work unless i used this one, so the k-means over time is best with the new data, and the average speed of that cluster is to allow speed to be considered if that makes sense

Thanks again amazing help, will be doing this to fit into below code, but yeah forgot to link it and it could help make it more specific and valuable.

Thanks guys!

like image 872
LeCoda Avatar asked Sep 14 '20 02:09

LeCoda


People also ask

How do you create a DataFrame group?

We use the function get_group() to find the entries contained in any of the groups. Output : Example #2: Use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting).

How do I split a DataFrame into a group?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

How do you group data frames by rows?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

How do you categorize age groups in python?

Bookmark this question. Show activity on this post. If age >= 0 & age < 2 then AgeGroup = Infant If age >= 2 & age < 4 then AgeGroup = Toddler If age >= 4 & age < 13 then AgeGroup = Kid If age >= 13 & age < 20 then AgeGroup = Teen and so on .....


1 Answers

First Data (Note: the further parts relates to the updates)

Data is very limited, probably due to the complexity to simplify it, so I shall make some assumptions and write this as generic as possible, so you can customize it fast to your needs.

Assumptions:

  1. You want to group by hours-windows ("hour_code") the data (therefore parameterized what data is grouped by as group_divide_set_by_column)
  2. For each hours-windows ("hour_code"), you want to cluster by location using K means algorithm

Doing so allows you to investigate the clusters of vehicles for each hour-window separately, and learn what clustered areas are more active and need attention.

Notes:

  1. Location column (although noted) is missing and required for the K-means algorithm (I used HostName_key but it's just a dummy so code would run, it not necessarily meaningful).
    Generally speaking, the K-means algorithm is for spaces with euclidean distance (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.)
    Here are some sources for k-means Python examples that are useful to further customize it: 1 2 3 4.

Code:

  1. Let's define a function, which given a dataframe group-divides it by a given column, group_divide_set_by_column.

This would allow us to group-divide by 'hour_code', and then cluster by location.

def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        # Divide to desired amount of clusters
        for clusters_number in clusters_number_list:
            create_cluster(group_df, clusters_number)
        # Setting column types
        set_colum_types(group_df)
    return lst_df_by_groups
  1. The #1 function would use another function to convert hour to hour codes, in similar to how you phrased it:

Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc..

def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
        """
        Divide hours to groups:
        Hours:
         1-5  => 1
         6-8  => 2
         9-11 => 3
        12-14 => 4
        15-17 => 5
        18-20 => 6
        21+   => 7
        """
        if      h < start_threshold:
            return 1
        elif    h >= end_threshold:
            return (end_threshold // windows)
        return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
  1. Moreover the #1 function would use the set_colum_types function that would convert columns to their matching types:
def set_colum_types(df):
    types_dict = {
        'Startdtm': 'datetime64[ns, Australia/Melbourne]',
        'HostName_key': 'category',
        'Totalvehicles': 'int32',
        'Enddtm': 'datetime64[ns, Australia/Melbourne]',
        'starthour': 'int32',
        'timedelta': 'float',
        'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
        df[col] = df[col].astype(col_type)
  1. A dedicated timeit decorator is used to measure the time for each clustering, so boilerplate code is reduced

Whole Code:

import functools

import pandas as pd

from timeit import default_timer as timer

import sklearn
from sklearn.cluster import KMeans

def timeit(func):
    @functools.wraps(func)
    def newfunc(*args, **kwargs):
        startTime = timer()
        func(*args, **kwargs)
        elapsedTime = timer() - startTime
        print('function [{}] finished in {} ms'.format(
            func.__name__, int(elapsedTime * 1000)))
    return newfunc

def set_colum_types(df):
    types_dict = {
        'Startdtm': 'datetime64[ns, Australia/Melbourne]',
        'HostName_key': 'category',
        'Totalvehicles': 'int32',
        'Enddtm': 'datetime64[ns, Australia/Melbourne]',
        'starthour': 'int32',
        'timedelta': 'float',
        'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
        df[col] = df[col].astype(col_type)

@timeit
def create_cluster(df, clusters_number):
    # Create K-Means model
    model       = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
    # Fetch location
    # NOTE: Should be a *real* location, used another column as dummy
    location_df = df[['HostName_key']]
    kmeans      = model.fit(location_df)
    # Divide to clusters
    df[f'kmeans_{clusters_number}'] = kmeans.labels_

def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
        """
        Divide hours to groups:
        Hours:
         1-5  => 1
         6-8  => 2
         9-11 => 3
        12-14 => 4
        15-17 => 5
        18-20 => 6
        21+   => 7
        """
        if      h < start_threshold:
            return 1
        elif    h >= end_threshold:
            return (end_threshold // windows)
        return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))

def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        # Divide to desired amount of clusters
        for clusters_number in clusters_number_list:
            create_cluster(group_df, clusters_number)
        # Setting column types
        set_colum_types(group_df)
    return lst_df_by_groups

# Load data
df = pd.read_csv('data.csv')

# Print data
print(df)

# Create clusters
lst_df_by_groups = create_clusters_by_group(df)

# For each hostname-key dataframe
for group_df_name, group_df in lst_df_by_groups.items():
    print(f'Group {group_df_name} dataframe:')
    print(group_df)

Example output:

              Startdtm  HostName_key  ...  timedelta vehiclespersec
0  2020-01-15 08:22:39             0  ...    29400.0       0.326633
1  2020-01-13 08:22:07             2  ...    28981.0       0.354474
2  2020-01-23 07:16:55             3  ...    26149.0       0.197904
3  2020-01-15 07:00:06             4  ...     2783.0       0.301114
4  2020-01-15 08:16:01             1  ...    15915.0       0.366949
5  2020-01-16 08:22:39             2  ...    29400.0       0.326633
6  2020-01-14 08:22:07             2  ...    28981.0       0.354479
7  2020-01-25 07:16:55             4  ...    26149.0       0.197904
8  2020-01-17 07:00:06             1  ...     2783.0       0.301114
9  2020-01-18 08:16:01             1  ...    15915.0       0.366949

[10 rows x 7 columns]
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
Group hour_code_0 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
0 2020-01-15 08:22:39+11:00            0  ...         1        1
1 2020-01-13 08:22:07+11:00            2  ...         0        0
2 2020-01-23 07:16:55+11:00            3  ...         0        2

[3 rows x 10 columns]
Group hour_code_1 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
3 2020-01-15 07:00:06+11:00            4  ...         1        1
4 2020-01-15 08:16:01+11:00            1  ...         0        0
5 2020-01-16 08:22:39+11:00            2  ...         0        2

[3 rows x 10 columns]
Group hour_code_2 dataframe:
                   Startdtm HostName_key  ...  kmeans_2 kmeans_3
6 2020-01-14 08:22:07+11:00            2  ...         1        2
7 2020-01-25 07:16:55+11:00            4  ...         0        0
8 2020-01-17 07:00:06+11:00            1  ...         1        1
9 2020-01-18 08:16:01+11:00            1  ...         1        1

[4 rows x 10 columns]

Update : Second Data

So, this time will make things a little different, as the updated objective is to understand how many vehicles are at each place and their speed.

Again, things are written with great care for generically for the ease of adaptation.

  1. First, we divide the data set to groups based on, their location which is inferred by Hostname (parameterized for customization as dividing_colum).
def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups
  1. Now, we arrange the data for each hostname (dividing_colum) group:
def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

2.1. We group by intervals of 15 minutes, and after each hostname area data is divided into time intervals, we aggregate the amount of vehicles to column volume and investigate the average speed to column average_speed.

def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_interval(df)
    return df

The end result for stage #2 is that each hostname data is divided into time windows of 15 minutes, and we know how many vehicles have passed each time and what is their average speed.

By this, we achieve the objective:

An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.

Again, all costumizable using [TIME_INTERVAL_COLUMN_NAME, DATE_COLUMN_NAME, INTERVAL_WINDOW].

The whole code:

import functools

import numpy
import pandas as pd

TIME_INTERVAL_COLUMN_NAME   = 'time_interval'

DATE_COLUMN_NAME            = 'DateTimeStamp'

INTERVAL_WINDOW             = '15Min'

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_interval(df)
    return df

def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

# Load data
df = pd.read_csv('data2.csv')

# Print data
print(df)

# Divide by column
df_by_groups = divide_df_by_column(df)

# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)

# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)

Example Output:

We can now get valuable results from measuring the volumes (amount of vehicles) and average speed, for each individual hostname area.

     DateTimeStamp  VS_ID  VS_Summary_Id   Hostname  Vehicle_speed  Lane  Length
0  11/01/2019 8:22      1              1  place_uno             65     2      71
1  11/01/2019 8:23      2              1  place_uno             59     1     375
2  11/01/2019 8:25      3              1  place_uno             59     1     389
3  11/01/2019 8:26      4              1  place_duo             59     1     832
4  11/01/2019 8:40      5              1  place_duo             52     1     409
Group Hostname_place_duo dataframe:
                     average_speed  volume
DateTimeStamp                             
2019-11-01 08:15:00             59       1
2019-11-01 08:30:00             52       1
Group Hostname_place_uno dataframe:
                     average_speed  volume
DateTimeStamp                             
2019-11-01 08:15:00             61       3

Appendix

Created also a round_time function, which allows to round to time intervals, without grouping:

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

Third Update

So this time we want to reduce the number of rows in the result.

  1. We change the way we group the data, not only based on interval but also for the day in week, the result would allow us investigate in how traffic behaves for each day of the week and it's 15-minutes intervals. The group_by_interval function is now changed to group on the concise inteval thus, will be called group_by_concised_interval.

We shall call the combination of [day-in-week, hour-minute] as "consice interval", again this is configurable with CONCISE_INTERVAL_FORMAT.

def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

1.1. The group_by_concised_interval first rounds time to the given 15-minutes interval (configurable via INTERVAL_WINDOW) using the round_time method.

1.2. After creating the time intervals for each date, we apply the add_consice_interval_columns function that given the rounded to inteval time stamp, extracts the concise form.

def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))

The whole code is:

import functools

import numpy
import pandas as pd

TIME_INTERVAL_COLUMN_NAME           = 'time_interval'

TIME_INTERVAL_CONCISE_COLUMN_NAME   = 'time_interval_concise'

DATE_COLUMN_NAME                    = 'DateTimeStamp'

INTERVAL_WINDOW                     = '15Min'

CONCISE_INTERVAL_FORMAT             = '%A %H:%M'

def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))

def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

def arrange_data(df):
    df = group_by_concised_interval(df)
    return df

def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
        df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

# Load data
df = pd.read_csv('data2.csv')

# Print data
print(df)

# Divide by column
df_by_groups = divide_df_by_column(df)

# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)

# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)

Output:

Group Hostname_place_duo dataframe:
                       average_speed  volume
time_interval_concise                       
Friday 08:30                      59       1
Friday 08:45                      52       1
Group Hostname_place_uno dataframe:
                       average_speed  volume
time_interval_concise                       
Friday 08:15                      65       1
Friday 08:30                      59       2

So now we can easily figure out how traffic behaves in each day of the week at all available time intervals.

like image 192
Aviv Yaniv Avatar answered Oct 26 '22 15:10

Aviv Yaniv