Combining vectors of unequal length and non-unique values

Question

I would like to do the following:

combine into a data frame, two vectors that

have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element

The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.

For example:

vector 1    vector 2                     vector 1        vector 2
   1           1                            a               a
   2           2                            g               g
   3           3                            b               b
   4           1            or              h               a
   1           2                            a               g
   2           3                            g               b   
   5           4                            c               h
               5                                            c

should be combined into data frame

    1   1                                    a   a
    2   2                                    g   g
    3   3                                    b   b
    4   NA                                   h   NA
    1   1                  or                a   a 
    2   2                                    g   g
    NA  3                                    NA  b
    NA  4                                    NA  h
    5   5                                    c   c

What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.

flodel · Accepted Answer

I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.

If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.

Part I: given two sequences, find the longest common subsequence

LongestCommonSubsequence <- function(X, Y) {
    m <- length(X)
    n <- length(Y)
    C <- matrix(0, 1 + m, 1 + n)
    for (i in seq_len(m)) {
        for (j in seq_len(n)) {
            if (X[i] == Y[j]) {
                C[i + 1, j + 1] = C[i, j] + 1
            } else {
                C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
            }
        }
    }

    backtrack <- function(C, X, Y, i, j) {
        if (i == 1 | j == 1) {
            return(data.frame(I = c(), J = c(), LCS = c()))
        } else if (X[i - 1] == Y[j - 1]) {
            return(rbind(backtrack(C, X, Y, i - 1, j - 1),
                         data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
        } else if (C[i, j - 1] > C[i - 1, j]) {
            return(backtrack(C, X, Y, i, j - 1))
        } else {
            return(backtrack(C, X, Y, i - 1, j))
        }
    }

    return(backtrack(C, X, Y, m + 1, n + 1))
}

Part II: given two sequences, find the shortest common supersequence

ShortestCommonSupersequence <- function(X, Y) {
    LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
    X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
    Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)   
    ALL <- merge(LCS, X.df, by = "I", all = TRUE)
    ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
    ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
                          ifelse(is.na(ALL$J), 0, ALL$J))), ]
    ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
    ALL
}

Your Example:

ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
                            Y = c("a","g","b","a","g","b","h","c"))
#    J  I    X    Y SCS
# 1  1  1    a    a   a
# 2  2  2    g    g   g
# 3  3  3    b    b   b
# 9 NA  4    h <NA>   h
# 4  4  5    a    a   a
# 5  5  6    g    g   g
# 6  6 NA <NA>    b   b
# 7  7 NA <NA>    h   h
# 8  8  7    c    c   c

(where the two updated vectors are in columns X and Y.)

Combining vectors of unequal length and non-unique values

Tags:

dataframe

r

missing-data

sequence

Dmitrii I.

1 Answers

flodel

Recent Activity

Donate For Us

Combining vectors of unequal length and non-unique values

Tags:

dataframe

r

missing-data

sequence

Dmitrii I.

1 Answers

flodel

Related questions

Recent Activity

Donate For Us