I have a list of lists that I want to group into separate lists based on clusters of time.
I can easily sort it based on the time, but I have not determined an easy way to group it together. I am fine with it being datetime / time format or text, either one works for me. I need to process the other data based on the cluster. This is a sample dataset that I might be working with.
[['asdf', '2012-01-01 00:00:12', '1234'],
['asdf', '2012-01-01 00:00:31', '1235'],
['asdf', '2012-01-01 00:00:57', '2345'],
['asdf', '2012-01-01 00:01:19', '2346'],
['asdf', '2012-01-01 00:01:25', '2345'],
['asdf', '2012-01-01 09:04:14', '3465'],
['asdf', '2012-01-01 09:04:34', '1613'],
['asdf', '2012-01-01 09:04:51', '8636'],
['asdf', '2012-01-01 09:05:15', '5847'],
['asdf', '2012-01-01 09:05:29', '3672'],
['asdf', '2012-01-01 09:05:30', '2367'],
['asdf', '2012-01-01 09:05:43', '9544'],
['asdf', '2012-01-01 14:48:15', '2572'],
['asdf', '2012-01-01 14:48:34', '7483'],
['asdf', '2012-01-01 14:48:56', '5782']]
The results should look something like this. A nested list of lists for each group.
[[['asdf', '2012-01-01 00:00:12', '1234'],
['asdf', '2012-01-01 00:00:31', '1235'],
['asdf', '2012-01-01 00:00:57', '2345'],
['asdf', '2012-01-01 00:01:19', '2346'],
['asdf', '2012-01-01 00:01:25', '2345']],
[['asdf', '2012-01-01 09:04:14', '3465'],
['asdf', '2012-01-01 09:04:34', '1613'],
['asdf', '2012-01-01 09:04:51', '8636'],
['asdf', '2012-01-01 09:05:15', '5847'],
['asdf', '2012-01-01 09:05:29', '3672'],
['asdf', '2012-01-01 09:05:30', '2367'],
['asdf', '2012-01-01 09:05:43', '9544']],
[['asdf', '2012-01-01 14:48:15', '2572'],
['asdf', '2012-01-01 14:48:34', '7483'],
['asdf', '2012-01-01 14:48:56', '5782']]]
The clusters are of no set size, and no set times. They can occur randomly throughout the day, and will need to cluster based on a large gap in the time.
The first group happens right after midnight and has 5 entries, the next one is centered around 09:05 and has 7 entries. The final one happens about 14:48 and only has 3 entries. I could also have two groups at either end of the hour as well, so I can not just group by the hour.
I have already sorted and grouped the data by the first field in the list, I just need to break them down into smaller chunks to process. I am willing the change the date to whatever format is necessary to get the grouping done as it is a key part of the analysis I am doing on the data.
I would prefer to keep the solution within the basic python libraries, but if there is no solution I can attempt to get other packages.
I have already looked at solutions here, here, here, here, and many others but none of them address the random nature of these times.
Splitting the list at any gap greater than X time would be a great solution, so I can change X to 5 or 10 minutes, whatever is deemed appropriate. Dropping any group that has length less than 3 would also be a bonus, but can easily be done at the end.
My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.
Any help would be greatly appreciated. If any of this doesn't make sense I will do my best to clarify.
The most common approach to time series clustering is to flatten the time series into a table, with a column for each time index (or aggregation of the series) and directly apply standard clustering algorithms like k-means.
Clustering is the grouping of particular set of objects or entity based on their characteristics and aggregating them according to their similarities. Clustering is similar to Classification, data are grouped. However, unlike the classification, the groups are not predefined.
First, the number of clusters must be specified and then this same number of 'centroids' are randomly allocated. The Euclidean distance is then measured between each data point and the centroids. The data point is then 'assigned' to the centroid which is closest.
If we split at time differences beyond some limit, then something like
# turn strings into datetimes
date_format = "%Y-%m-%d %H:%M:%S"
for row in data:
row[1] = datetime.datetime.strptime(row[1], date_format)
split_dt = datetime.timedelta(minutes=5)
dts = (d1[1]-d0[1] for d0, d1 in zip(data, data[1:]))
split_at = [i for i, dt in enumerate(dts, 1) if dt >= split_dt]
groups = [data[i:j] for i, j in zip([0]+split_at, split_at+[None])]
might work. (Beware of fencepost errors, though.. I make them too easily!)
I'm not going to solve your problem, but I'll try to make you feel better about what you already know ;-)
Forget all the details of your problem and think about a list of plain integers instead. Say you want to break it into groups via gaps of at least 5. Here's the list:
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, ...]
Oops! Every element is in its own group then, and there's simply no way to know that without comparing every adjacent pair of elements. Think about it. So:
My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.
In the example above, that's the best that can be done! It takes time linear in the number of elements, which is rarely considered "very inefficient".
Now in some cases it's certainly possible to do better. Let's change the list above to:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...]
Again with gap 5, there's only one group in total. Can that be discovered using less than a number of compares proportional to the length of the list? Maybe, using variants of binary search, it would be possible to discover that using a number of compares proportional to the logarithm of the length of the list. But details are everything here and they're tricky. So tricky that I dread adapting them to your messier problem.
And, in the end, unless you have very large groups, I expect it would actually be slower than doing an obvious thing! DSM's answer uses efficient and more-or-less straightforward Python idiom; a complex algorithm that needs to keep track of many little details generally runs slower (even if it has far better theoretical O()
behavior) unless applied to very favorable cases.
So be happy with a straightforward loop you understand at a glance :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With