Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining vectors of unequal length and non-unique values

I would like to do the following:

combine into a data frame, two vectors that

  • have different length
  • contain sequences found also in the other vector
  • contain sequences not found in the other vector
  • sequences that are not found in other vector are never longer than 3 elements
  • always have same first element

The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.

For example:

vector 1    vector 2                     vector 1        vector 2
   1           1                            a               a
   2           2                            g               g
   3           3                            b               b
   4           1            or              h               a
   1           2                            a               g
   2           3                            g               b   
   5           4                            c               h
               5                                            c

should be combined into data frame

    1   1                                    a   a
    2   2                                    g   g
    3   3                                    b   b
    4   NA                                   h   NA
    1   1                  or                a   a 
    2   2                                    g   g
    NA  3                                    NA  b
    NA  4                                    NA  h
    5   5                                    c   c

What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.

like image 666
Dmitrii I. Avatar asked Jan 15 '23 18:01

Dmitrii I.


1 Answers

I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.

If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.

Part I: given two sequences, find the longest common subsequence

LongestCommonSubsequence <- function(X, Y) {
    m <- length(X)
    n <- length(Y)
    C <- matrix(0, 1 + m, 1 + n)
    for (i in seq_len(m)) {
        for (j in seq_len(n)) {
            if (X[i] == Y[j]) {
                C[i + 1, j + 1] = C[i, j] + 1
            } else {
                C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
            }
        }
    }

    backtrack <- function(C, X, Y, i, j) {
        if (i == 1 | j == 1) {
            return(data.frame(I = c(), J = c(), LCS = c()))
        } else if (X[i - 1] == Y[j - 1]) {
            return(rbind(backtrack(C, X, Y, i - 1, j - 1),
                         data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
        } else if (C[i, j - 1] > C[i - 1, j]) {
            return(backtrack(C, X, Y, i, j - 1))
        } else {
            return(backtrack(C, X, Y, i - 1, j))
        }
    }

    return(backtrack(C, X, Y, m + 1, n + 1))
}

Part II: given two sequences, find the shortest common supersequence

ShortestCommonSupersequence <- function(X, Y) {
    LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
    X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
    Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)   
    ALL <- merge(LCS, X.df, by = "I", all = TRUE)
    ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
    ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
                          ifelse(is.na(ALL$J), 0, ALL$J))), ]
    ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
    ALL
}

Your Example:

ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
                            Y = c("a","g","b","a","g","b","h","c"))
#    J  I    X    Y SCS
# 1  1  1    a    a   a
# 2  2  2    g    g   g
# 3  3  3    b    b   b
# 9 NA  4    h <NA>   h
# 4  4  5    a    a   a
# 5  5  6    g    g   g
# 6  6 NA <NA>    b   b
# 7  7 NA <NA>    h   h
# 8  8  7    c    c   c

(where the two updated vectors are in columns X and Y.)

like image 149
flodel Avatar answered Jan 21 '23 13:01

flodel