Clustering / Grouping a list based on time (python)

Tags:

python-2.7

I have a list of lists that I want to group into separate lists based on clusters of time.

I can easily sort it based on the time, but I have not determined an easy way to group it together. I am fine with it being datetime / time format or text, either one works for me. I need to process the other data based on the cluster. This is a sample dataset that I might be working with.

[['asdf', '2012-01-01 00:00:12', '1234'],
 ['asdf', '2012-01-01 00:00:31', '1235'],
 ['asdf', '2012-01-01 00:00:57', '2345'],
 ['asdf', '2012-01-01 00:01:19', '2346'],
 ['asdf', '2012-01-01 00:01:25', '2345'],
 ['asdf', '2012-01-01 09:04:14', '3465'],
 ['asdf', '2012-01-01 09:04:34', '1613'],
 ['asdf', '2012-01-01 09:04:51', '8636'],
 ['asdf', '2012-01-01 09:05:15', '5847'],
 ['asdf', '2012-01-01 09:05:29', '3672'],
 ['asdf', '2012-01-01 09:05:30', '2367'],
 ['asdf', '2012-01-01 09:05:43', '9544'],
 ['asdf', '2012-01-01 14:48:15', '2572'],
 ['asdf', '2012-01-01 14:48:34', '7483'],
 ['asdf', '2012-01-01 14:48:56', '5782']]

The results should look something like this. A nested list of lists for each group.

[[['asdf', '2012-01-01 00:00:12', '1234'],
  ['asdf', '2012-01-01 00:00:31', '1235'],
  ['asdf', '2012-01-01 00:00:57', '2345'],
  ['asdf', '2012-01-01 00:01:19', '2346'],
  ['asdf', '2012-01-01 00:01:25', '2345']],
 [['asdf', '2012-01-01 09:04:14', '3465'],
  ['asdf', '2012-01-01 09:04:34', '1613'],
  ['asdf', '2012-01-01 09:04:51', '8636'],
  ['asdf', '2012-01-01 09:05:15', '5847'],
  ['asdf', '2012-01-01 09:05:29', '3672'],
  ['asdf', '2012-01-01 09:05:30', '2367'],
  ['asdf', '2012-01-01 09:05:43', '9544']],
 [['asdf', '2012-01-01 14:48:15', '2572'],
  ['asdf', '2012-01-01 14:48:34', '7483'],
  ['asdf', '2012-01-01 14:48:56', '5782']]]

The clusters are of no set size, and no set times. They can occur randomly throughout the day, and will need to cluster based on a large gap in the time.

The first group happens right after midnight and has 5 entries, the next one is centered around 09:05 and has 7 entries. The final one happens about 14:48 and only has 3 entries. I could also have two groups at either end of the hour as well, so I can not just group by the hour.

I have already sorted and grouped the data by the first field in the list, I just need to break them down into smaller chunks to process. I am willing the change the date to whatever format is necessary to get the grouping done as it is a key part of the analysis I am doing on the data.

I would prefer to keep the solution within the basic python libraries, but if there is no solution I can attempt to get other packages.

I have already looked at solutions here, here, here, here, and many others but none of them address the random nature of these times.

Splitting the list at any gap greater than X time would be a great solution, so I can change X to 5 or 10 minutes, whatever is deemed appropriate. Dropping any group that has length less than 3 would also be a bonus, but can easily be done at the end.

My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.

Any help would be greatly appreciated. If any of this doesn't make sense I will do my best to clarify.

506

asked Nov 21 '13 03:11

eseglem

2 Answers

If we split at time differences beyond some limit, then something like

# turn strings into datetimes
date_format = "%Y-%m-%d %H:%M:%S"
for row in data:
    row[1] = datetime.datetime.strptime(row[1], date_format)

split_dt = datetime.timedelta(minutes=5)
dts = (d1[1]-d0[1] for d0, d1 in zip(data, data[1:]))
split_at = [i for i, dt in enumerate(dts, 1) if dt >= split_dt]
groups = [data[i:j] for i, j in zip([0]+split_at, split_at+[None])]

might work. (Beware of fencepost errors, though.. I make them too easily!)

168

answered Nov 07 '22 06:11

DSM

I'm not going to solve your problem, but I'll try to make you feel better about what you already know ;-)

Forget all the details of your problem and think about a list of plain integers instead. Say you want to break it into groups via gaps of at least 5. Here's the list:

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, ...]

Oops! Every element is in its own group then, and there's simply no way to know that without comparing every adjacent pair of elements. Think about it. So:

My only real idea right now is to loop through the list compare the current time with the new time and split the list that way, but it seems like a very inefficient way of solving this problem when there are millions of records to sort and group.

In the example above, that's the best that can be done! It takes time linear in the number of elements, which is rarely considered "very inefficient".

Now in some cases it's certainly possible to do better. Let's change the list above to:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...]

Again with gap 5, there's only one group in total. Can that be discovered using less than a number of compares proportional to the length of the list? Maybe, using variants of binary search, it would be possible to discover that using a number of compares proportional to the logarithm of the length of the list. But details are everything here and they're tricky. So tricky that I dread adapting them to your messier problem.

And, in the end, unless you have very large groups, I expect it would actually be slower than doing an obvious thing! DSM's answer uses efficient and more-or-less straightforward Python idiom; a complex algorithm that needs to keep track of many little details generally runs slower (even if it has far better theoretical O() behavior) unless applied to very favorable cases.

So be happy with a straightforward loop you understand at a glance :-)

answered Nov 07 '22 06:11

Tim Peters

Related questions
                            
                                Lifetime of default function arguments in python [duplicate]
                            
                                How To Add An Icon Of My Own To A Python Program
                            
                                Copy text between parentheses in pandas DataFrame column into another column
                            
                                Does anyone know a working example of 2dsphere index in pymongo?
                            
                                python - create a pivot table
                            
                                Django Nose how to write this test?
                            
                                python regular expression to remove repeated words
                            
                                what does 'if x.strip( )' mean?
                            
                                Trouble with basemap subplots
                            
                                How to get all datetime instances of the current week, given a day?
                            
                                How can I format a float using matplotlib's LaTeX formatter?
                            
                                Django CreateView gives an error "needs to have a value for field "..." before this many-to-many relationship can be used."
                            
                                Size of Python Counter
                            
                                Python execute code only if for loop did not begin iteration (with generator)?
                            
                                Fastest way to sort multiple lists - Python
                            
                                How to get n-gram collocations and association in python nltk?
                            
                                Type error on calling validate in wtforms
                            
                                Python Mysqldb returning '\x01' for bit values
                            
                                Pip freeze gives me this error related with git
                            
                                how to find most frequent string element in numpy ndarray?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With