Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to squeeze in missing values into a vector

Tags:

r

Let me try to make this question as general as possible.

Let's say I have two variables a and b.

a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])

So b has 17 observations and is a subset of a which has 20 observations.

My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?

Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?

It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.

like image 873
hjms Avatar asked May 01 '14 21:05

hjms


3 Answers

1) loop Try this loop:

# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])

# lets work with vectors
A <- a[[1]]
B <- b[[1]]

j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA

which gives:

> C
 [1]  2  7  4  8  9  0 NA  8  5  4 NA  4  6  5 NA  8  2  0  3  9

2) Reduce Here is a loop-free version:

f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)

giving:

[1]  2  7  4  8  9  0 NA  8  5  4 NA  4  6  5 NA  8  2  0  3  9

3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:

library(dtw)

ifelse(duplicated(dtw(A, B)$index2), NA, A)

giving:

[1]  2  7  4  8  9  0 NA  8  5  4 NA  4  6  5 NA  8  2  0  3  9

REVISED Added additional solutions.

like image 158
G. Grothendieck Avatar answered Sep 22 '22 12:09

G. Grothendieck


Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):

# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]

# compute the transformation between the two, assigning infinite weight to 
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
                    costs = c(Inf,1,Inf), counts = TRUE), 'trafos')

C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1]  2  7  4  8  9  0 NA  8  5  4 NA  4  6  5 NA  8  2  0  3  9

More complex matching example (where the greedy algorithm would perform poorly):

A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)

transf = attr(adist(intToUtf8(A), intToUtf8(B),
                    costs = c(Inf,1,Inf), counts = TRUE), 'trafos')

C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA  1  1  1  2  2  2

# the greedy algorithm would return this instead:
#[1]  1  1 NA NA  1 NA NA  2  2  2
like image 24
eddi Avatar answered Sep 20 '22 12:09

eddi


The data frame version, which isn't terribly different from G.'s above. (Assumes a,b setup as above).

j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
    if (a[i,1]==b[j,1]) {
        j <- j+1
        } else 
        {
        c[i,1] <- NA
        }

}
like image 31
Joe Avatar answered Sep 21 '22 12:09

Joe