Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Tags:

Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).

For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.

For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.

I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.

Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.

Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.

Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots

329

asked Feb 20 '10 06:02

wishihadabettername

2 Answers

I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.

The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.

I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.

answered Oct 04 '22 09:10

Aishwar

Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?

answered Oct 04 '22 09:10

dsimcha

Related questions
                            
                                Zoom unscaled views in UIScrollView to bounds
                            
                                How to find the most likely sequences of hidden states for a Hidden Markov Model
                            
                                How do I visualize audio data?
                            
                                Why is an even-odd split 'faster' for MergeSort?
                            
                                Algorithm (or C# library) for identifying 'keywords' in a set of messages? [closed]
                            
                                multithreaded algo for cycle detection in a directed graph
                            
                                Solving "string reduction" challenge
                            
                                How can I check if a string can be pronounced?
                            
                                Sorting a permutation with minimum cost
                            
                                Longest Subsequence with all occurrences of a character at 1 place
                            
                                Bottom up set generation and ordering
                            
                                Traversing Line Segments
                            
                                Get the closest value for combinations of an array (JS)
                            
                                can it be solved in linear time, did this in n^2 time
                            
                                Implementation of a "hits in last [second/minute/hour]" data structure
                            
                                How to check if a box fits into another box (any rotations allowed)
                            
                                Why doesn't Boost.Range is_sorted require forward iterators?
                            
                                Arrays: Find minimum number of swaps to make bitonicity of array minimum?
                            
                                Sort algorithm for Excel / SharedStrings
                            
                                Job queue optimization algorithms

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which data clustering algorithm is appropriate to detect an unknown number of clusters in a time series of events?

Tags:

language-agnostic

algorithm

cluster-analysis

wishihadabettername

People also ask

2 Answers

Aishwar

dsimcha

Recent Activity

Donate For Us