Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.

Question: I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.

setup

I have a sequence ground.truth that is basically a set of times:

ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
                rep( seq(0,length.out=5,by=4*10+30), each=10 )

Think of ground.truth as times where I'm doing the following:

{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5

I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:

nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs     <- length(observations)

If I plot these vectors this is what it looks like (remember, think of these as times):

enter image description here

What I've tried. I want to:

  • calculate the shift (theLag in my example above)
  • calculate a vector idx such that ground.truth[idx] == observations - theLag

First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.

To calculate this, I thought I'd use cross-correlation (ccf function).

However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).

The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.

Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.

Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?

Thanks a lot.

like image 343
mathematical.coffee Avatar asked Jan 17 '23 21:01

mathematical.coffee


1 Answers

For the lag, you can compute all the differences (distances) between your two sets of points:

diffs <- outer(observations, ground.truth, '-')

Your lag should be the value that appears length(observations) times:

which(table(diffs) == length(observations))
# 55.715382960625 
#              86 

Double check:

theLag
# [1] 55.71538

The second part of your question is easy once you have found theLag:

idx <- which(ground.truth %in% (observations - theLag))
like image 104
flodel Avatar answered Jan 26 '23 04:01

flodel