How do I use k-means on time series data that has nans?

Question

I have a number of time series records that overlap at some times and don't necessarily have same start and end date. Each row represents a different time series. I made them all the same length to maintain the actual time of data collection.

For example, at t(1,2,3,4,5,6):

Station 1: nan, nan, 2, 4, 5, 10

Station 2: nan, 1, 4, nan, 10, 8

Station 3: 1, 9, 4, 7, nan, nan

I am trying to run a cluster analysis in Python to group the stations with similar behavior, where the timing of the behavior is important, so I can't just get rid of the nans. (That I know of).

Any ideas?

Has QUIT--Anony-Mousse · Accepted Answer

K-means is not the best algorithm for this kind of data.

K-means is designed to minimize within-cluster variance (= sum of squares, WCSS).

But how do you compute variance with NaNs? And how meaningful is variance here anyway?

Instead, you may want to use

a similarity measure designed for time series, such as DTW, threshold crossing distances etc.
a distance based clustering algorithm. If you only have a few series, hierarchical clustering should be fine.

How do I use k-means on time series data that has nans?

Tags:

python

numpy

cluster-analysis

time-series

user2748977

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

How do I use k-means on time series data that has nans?

Tags:

python

numpy

cluster-analysis

time-series

user2748977

1 Answers

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us