Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding Sequences [gap or difference] between two vectors

Tags:

r

sequences

Consider I have two vectors

a <- c(1,3,5,7,9, 23,35,36,43)
b <- c(2,4,6,8,10,24, 37, 45)

Please notice the length of both are different.

I want to find the gap/difference/sequence between two vectors to match based on closest proximity.

Expected Output

a     b
1     2
3     4
5     6
7     8
9     10
23    24
35    NA
36    37
43    45

Please notice that 35 has NA against it because 36 has a sequence matching/closest proximity with 37.

like image 335
BrownNation Avatar asked Apr 10 '18 17:04

BrownNation


People also ask

How do you compare two different sequences?

In general, we can compare two sequences by placing them above each other in rows and comparing them character by character. This way we could align two different audio recordings of a piece of music.

What is a gap in a sequence?

A gap in one of the sequences means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence.

Why we use gaps in sequence alignment?

To obtain the best possible alignment between two sequences, it is necessary to include gaps in sequence alignments and use gap penalties. For aligning DNA sequences, a simple positive score for matches and a negative score for mismatches and gaps are most often used.

When you are comparing two or more than two sequences of same or different organisms What is the type of the alignment?

Abstract. Multiple sequence alignment (MSA) is a tool used to identify the evolutionary relationships and common patterns between genes. Precisely it refers to the sequence alignment of three or more biological sequences, usually DNA, RNA or protein. Alignments are generated and analysed with computational algorithms.


2 Answers

You can using findInterval

df=data.frame(a)
df$b[findInterval(b, a)]=b
df
   a  b
1  1  2
2  3  4
3  5  6
4  7  8
5  9 10
6 23 24
7 35 NA
8 36 37
9 43 45
like image 171
BENY Avatar answered Oct 11 '22 09:10

BENY


This algorithm can only deal with one NA. For N possible NA's, you just have to try all combination(length(b), N) possibilities. Tries to find min(abs(a-b)) for every possible NA insertion slot.

  # Try insertion
  Map(f = function(i) mean(abs(append(b, NA, i) - a), na.rm = T),
      i = 1:length(b)) %>%
  # Find index of the best insertion spot
  which.min %>%
  # Actually insert
  {append(b, NA, .)} %>%
  # Display data
  {cbind(a, b = .)}

       a  b
 [1,]  1  2
 [2,]  3  4
 [3,]  5  6
 [4,]  7  8
 [5,]  9 10
 [6,] 23 24
 [7,] 35 NA
 [8,] 36 37
 [9,] 43 45
like image 43
Vlo Avatar answered Oct 11 '22 09:10

Vlo