I have a dataset that looks something like this:
date area_key total_units timeatend starthour timedifference vps
2020-01-15 08:22:39 0 9603 2020-01-15 16:32:39 8 29400.0 0.32663265306122446
2020-01-13 08:22:07 0 10273 2020-01-13 16:25:08 8 28981.0 0.35447362064801075
2020-01-23 07:16:55 3 5175 2020-01-23 14:32:44 7 26149.0 0.19790431756472524
2020-01-15 07:00:06 1 838 2020-01-15 07:46:29 7 2783.0 0.3011139058569889
2020-01-15 08:16:01 1 5840 2020-01-15 12:41:16 8 15915.0 0.3669494187873076
That is then being computed into this to create a kmeans cluster.
def cluster_Volume(inputData):
start_tot = time.time()
Volume = inputData.groupby(['Startdtm'])['vehiclespersec'].sum().unstack()
## 4 Clusters
model = clstr.MiniBatchKMeans(n_clusters=5)
model.fit(Volume.fillna(0))
Volume['kmeans_4'] = model.predict(Volume.fillna(0))
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))
## 8 Clusters
start_tot = time.time()
model = clstr.KMeans(n_clusters=8)
model.fit(Volume.fillna(0))
Volume['kmeans_8'] = model.predict(Volume.fillna(0))
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))
## Looking at hourly distribution.
start_tot = time.time()
Volume_Hourly = Volume.reset_index().set_index(['Startdtm'])
Volume_Hourly['hour'] = Volume_Hourly.index.hour
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))
return Volume, Volume_Hourly
What I want to do is to make those clusters relate to both time periods and keys.
With the time periods - 7 am to 10 am, and 4 pm to 6 pm, and 12 pm to 2 pm, with 6 pm to 12 am, 12 am to 7 am, and 10 am to 12 pm, 2 pm to 4 pm as other time periods.
And with the keys - showing how each cluster differently in a programmatic way.
The desired result will have a table similar to below, but feel free to develop it in the best way you can think of. Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc.. but feel free to change it as suits - just my thoughts
I've tried a few approaches to this using groupby
, but it doesn't seem to work super well - would love some guidance here.
This data is the individual occurences as an example.
DateTimeStamp VS_ID VS_Summary_Id Hostname Vehicle_speed Lane Length
11/01/2019 8:22 1 1 place_uno 65 2 71
11/01/2019 8:22 2 1 place_uno 59 1 375
11/01/2019 8:22 3 1 place_uno 59 1 389
11/01/2019 8:22 4 1 place_duo 59 1 832
11/01/2019 8:22 5 1 place_duo 52 1 409
To get volumes I need to aggregate over time in smaller volume blocks (15 second or 15 minute, will post code below).
Then essentially same idea. An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.
Thanks again amazing help, will be doing this to fit into below code, but yeah forgot to link it and it could help make it more specific and valuable.
Thanks guys!
We use the function get_group() to find the entries contained in any of the groups. Output : Example #2: Use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting).
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Bookmark this question. Show activity on this post. If age >= 0 & age < 2 then AgeGroup = Infant If age >= 2 & age < 4 then AgeGroup = Toddler If age >= 4 & age < 13 then AgeGroup = Kid If age >= 13 & age < 20 then AgeGroup = Teen and so on .....
First Data (Note: the further parts relates to the updates)
Data is very limited, probably due to the complexity to simplify it, so I shall make some assumptions and write this as generic as possible, so you can customize it fast to your needs.
Assumptions:
group_divide_set_by_column
)Doing so allows you to investigate the clusters of vehicles for each hour-window separately, and learn what clustered areas are more active and need attention.
Notes:
HostName_key
but it's just a dummy so code would run, it not necessarily meaningful). Code:
group_divide_set_by_column
.This would allow us to group-divide by 'hour_code', and then cluster by location.
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
hour
to hour codes
, in similar to how you phrased it:Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc..
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
set_colum_types
function that would convert columns to their matching types:def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
timeit
decorator is used to measure the time for each clustering, so boilerplate code is reducedWhole Code:
import functools
import pandas as pd
from timeit import default_timer as timer
import sklearn
from sklearn.cluster import KMeans
def timeit(func):
@functools.wraps(func)
def newfunc(*args, **kwargs):
startTime = timer()
func(*args, **kwargs)
elapsedTime = timer() - startTime
print('function [{}] finished in {} ms'.format(
func.__name__, int(elapsedTime * 1000)))
return newfunc
def set_colum_types(df):
types_dict = {
'Startdtm': 'datetime64[ns, Australia/Melbourne]',
'HostName_key': 'category',
'Totalvehicles': 'int32',
'Enddtm': 'datetime64[ns, Australia/Melbourne]',
'starthour': 'int32',
'timedelta': 'float',
'vehiclespersec': 'float',
}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)
@timeit
def create_cluster(df, clusters_number):
# Create K-Means model
model = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
# Fetch location
# NOTE: Should be a *real* location, used another column as dummy
location_df = df[['HostName_key']]
kmeans = model.fit(location_df)
# Divide to clusters
df[f'kmeans_{clusters_number}'] = kmeans.labels_
def divide_df_by_hours(df):
def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
"""
Divide hours to groups:
Hours:
1-5 => 1
6-8 => 2
9-11 => 3
12-14 => 4
15-17 => 5
18-20 => 6
21+ => 7
"""
if h < start_threshold:
return 1
elif h >= end_threshold:
return (end_threshold // windows)
return h // windows
df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
# Divide et by hours
divide_df_by_hours(df)
lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
# Divide to desired amount of clusters
for clusters_number in clusters_number_list:
create_cluster(group_df, clusters_number)
# Setting column types
set_colum_types(group_df)
return lst_df_by_groups
# Load data
df = pd.read_csv('data.csv')
# Print data
print(df)
# Create clusters
lst_df_by_groups = create_clusters_by_group(df)
# For each hostname-key dataframe
for group_df_name, group_df in lst_df_by_groups.items():
print(f'Group {group_df_name} dataframe:')
print(group_df)
Example output:
Startdtm HostName_key ... timedelta vehiclespersec
0 2020-01-15 08:22:39 0 ... 29400.0 0.326633
1 2020-01-13 08:22:07 2 ... 28981.0 0.354474
2 2020-01-23 07:16:55 3 ... 26149.0 0.197904
3 2020-01-15 07:00:06 4 ... 2783.0 0.301114
4 2020-01-15 08:16:01 1 ... 15915.0 0.366949
5 2020-01-16 08:22:39 2 ... 29400.0 0.326633
6 2020-01-14 08:22:07 2 ... 28981.0 0.354479
7 2020-01-25 07:16:55 4 ... 26149.0 0.197904
8 2020-01-17 07:00:06 1 ... 2783.0 0.301114
9 2020-01-18 08:16:01 1 ... 15915.0 0.366949
[10 rows x 7 columns]
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
function [create_cluster] finished in 10 ms
function [create_cluster] finished in 11 ms
Group hour_code_0 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
0 2020-01-15 08:22:39+11:00 0 ... 1 1
1 2020-01-13 08:22:07+11:00 2 ... 0 0
2 2020-01-23 07:16:55+11:00 3 ... 0 2
[3 rows x 10 columns]
Group hour_code_1 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
3 2020-01-15 07:00:06+11:00 4 ... 1 1
4 2020-01-15 08:16:01+11:00 1 ... 0 0
5 2020-01-16 08:22:39+11:00 2 ... 0 2
[3 rows x 10 columns]
Group hour_code_2 dataframe:
Startdtm HostName_key ... kmeans_2 kmeans_3
6 2020-01-14 08:22:07+11:00 2 ... 1 2
7 2020-01-25 07:16:55+11:00 4 ... 0 0
8 2020-01-17 07:00:06+11:00 1 ... 1 1
9 2020-01-18 08:16:01+11:00 1 ... 1 1
[4 rows x 10 columns]
Update : Second Data
So, this time will make things a little different, as the updated objective is to understand how many vehicles are at each place and their speed.
Again, things are written with great care for generically for the ease of adaptation.
dividing_colum
).def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
dividing_colum
) group:def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
2.1. We group by intervals of 15 minutes, and after each hostname area data is divided into time intervals, we aggregate the amount of vehicles to column volume
and investigate the average speed to column average_speed
.
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
The end result for stage #2 is that each hostname data is divided into time windows of 15 minutes, and we know how many vehicles have passed each time and what is their average speed.
By this, we achieve the objective:
An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.
Again, all costumizable using [TIME_INTERVAL_COLUMN_NAME
, DATE_COLUMN_NAME
, INTERVAL_WINDOW
].
The whole code:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def group_by_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
Example Output:
We can now get valuable results from measuring the volumes (amount of vehicles) and average speed, for each individual hostname area.
DateTimeStamp VS_ID VS_Summary_Id Hostname Vehicle_speed Lane Length
0 11/01/2019 8:22 1 1 place_uno 65 2 71
1 11/01/2019 8:23 2 1 place_uno 59 1 375
2 11/01/2019 8:25 3 1 place_uno 59 1 389
3 11/01/2019 8:26 4 1 place_duo 59 1 832
4 11/01/2019 8:40 5 1 place_duo 52 1 409
Group Hostname_place_duo dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 59 1
2019-11-01 08:30:00 52 1
Group Hostname_place_uno dataframe:
average_speed volume
DateTimeStamp
2019-11-01 08:15:00 61 3
Appendix
Created also a round_time
function, which allows to round to time intervals, without grouping:
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
Third Update
So this time we want to reduce the number of rows in the result.
group_by_interval
function is now changed to group on the concise inteval thus, will be called group_by_concised_interval
.We shall call the combination of [day-in-week, hour-minute] as "consice interval", again this is configurable with CONCISE_INTERVAL_FORMAT
.
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
1.1. The group_by_concised_interval
first rounds time to the given 15-minutes interval (configurable via INTERVAL_WINDOW
) using the round_time
method.
1.2. After creating the time intervals for each date, we apply the add_consice_interval_columns
function that given the rounded to inteval time stamp, extracts the concise form.
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
The whole code is:
import functools
import numpy
import pandas as pd
TIME_INTERVAL_COLUMN_NAME = 'time_interval'
TIME_INTERVAL_CONCISE_COLUMN_NAME = 'time_interval_concise'
DATE_COLUMN_NAME = 'DateTimeStamp'
INTERVAL_WINDOW = '15Min'
CONCISE_INTERVAL_FORMAT = '%A %H:%M'
def round_time(df):
# Setting date_column_name to be of dateime
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Grouping by interval
df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
def add_consice_interval_columns(df):
# Adding columns for time interval in day-in-week and hour-minute resolution
df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
def group_by_concised_interval(df):
df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
# Rounding time
round_time(df)
# Adding concised interval
add_consice_interval_columns(df)
intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
return intervaled_df
def arrange_data(df):
df = group_by_concised_interval(df)
return df
def divide_df_by_column(df, dividing_colum='Hostname'):
df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
return df_by_groups
def arrange_groups_df(lst_df_by_groups):
df_by_intervaled_group = dict()
# For each group dataframe
for group_df_name, group_df in lst_df_by_groups.items():
df_by_intervaled_group[group_df_name] = arrange_data(group_df)
return df_by_intervaled_group
# Load data
df = pd.read_csv('data2.csv')
# Print data
print(df)
# Divide by column
df_by_groups = divide_df_by_column(df)
# Arrange data for each group
df_by_intervaled_group = arrange_groups_df(df_by_groups)
# For each hostname-key dataframe
for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
print(f'Group {group_df_name} dataframe:')
print(intervaled_group_df)
Output:
Group Hostname_place_duo dataframe:
average_speed volume
time_interval_concise
Friday 08:30 59 1
Friday 08:45 52 1
Group Hostname_place_uno dataframe:
average_speed volume
time_interval_concise
Friday 08:15 65 1
Friday 08:30 59 2
So now we can easily figure out how traffic behaves in each day of the week at all available time intervals.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With