Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use k-means on time series data that has nans?

I have a number of time series records that overlap at some times and don't necessarily have same start and end date. Each row represents a different time series. I made them all the same length to maintain the actual time of data collection.

For example, at t(1,2,3,4,5,6):

Station 1: nan, nan, 2, 4, 5, 10

Station 2: nan, 1, 4, nan, 10, 8

Station 3: 1, 9, 4, 7, nan, nan

I am trying to run a cluster analysis in Python to group the stations with similar behavior, where the timing of the behavior is important, so I can't just get rid of the nans. (That I know of).

Any ideas?

like image 349
user2748977 Avatar asked Nov 01 '22 15:11

user2748977


1 Answers

K-means is not the best algorithm for this kind of data.

K-means is designed to minimize within-cluster variance (= sum of squares, WCSS).

But how do you compute variance with NaNs? And how meaningful is variance here anyway?

Instead, you may want to use

  • a similarity measure designed for time series, such as DTW, threshold crossing distances etc.
  • a distance based clustering algorithm. If you only have a few series, hierarchical clustering should be fine.
like image 82
Has QUIT--Anony-Mousse Avatar answered Nov 09 '22 09:11

Has QUIT--Anony-Mousse