I have a number of time series records that overlap at some times and don't necessarily have same start and end date. Each row represents a different time series. I made them all the same length to maintain the actual time of data collection.
For example, at t(1,2,3,4,5,6):
Station 1: nan, nan, 2, 4, 5, 10
Station 2: nan, 1, 4, nan, 10, 8
Station 3: 1, 9, 4, 7, nan, nan
I am trying to run a cluster analysis in Python to group the stations with similar behavior, where the timing of the behavior is important, so I can't just get rid of the nans. (That I know of).
Any ideas?
K-means is not the best algorithm for this kind of data.
K-means is designed to minimize within-cluster variance (= sum of squares, WCSS).
But how do you compute variance with NaNs? And how meaningful is variance here anyway?
Instead, you may want to use
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With