Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

match two columns with two other columns

Tags:

r

I have several rows of data (tab separated). I want to find the row which matches elements from two columns (3rd & 4th) in each row with two other columns (10th & 11th). For example, in row 1, 95428891 & 95443771 in column 3 & 4 matches elements in columns 10 & 11 in row 19. Similarly the reciprocal is also true. Elements in columns 3 & 4 in the 19th row also match elements in columns 10 & 11 in row 1. I need to be able to go through each row and output row indices for corresponding matches. It is possible that sometimes only one of the columns match instead of both (because sometimes there are duplicate numbers), but I need to pick only rows where both columns match and also where there is reciprocal match. So it would be a good idea to output row indices where there is reciprocal match, for eg., 1 & 19 as tab separated values (maybe in a different data.frame object). And the rows that do not have reciprocal matches can be output separately. I am trying to implement this in R to run through several rows of data.

1313    chr2    95428891    95443771    14880   chr2:96036782   205673  +   chr2    96036782    96052481
1313    chr2    95428896    95443771    14875   chr2:97111880   205214  -   chr2    97111880    97127588
1313    chr2    95443771    95526464    82693   chr2:95609272   1748861 -   chr2    95609272    95691902
1313    chr2    95477143    95486318    9175    chr2:97616847   177391  +   chr2    97616847    97626039
1313    chr2    95486323    95521267    34944   chr2:97035158   268351  +   chr2    97035158    97070183
1313    chr2    95515418    95525958    10540   chr2:95563236   132439  +   chr2    95563236    95572666
1314    chr2    95563236    95572666    9430    chr2:95515418   132439  +   chr2    95515418    95525958
1314    chr2    95563236    95572666    9430    chr2:95609778   126017  -   chr2    95609778    95620287
1314    chr2    95563236    95569115    5879    chr2:97064308   89848   +   chr2    97064308    97070183
164     chr2    95609272    95691902    82630   chr2:95443771   1748861 -   chr2    95443771    95526464
1314    chr2    95609778    95620287    10509   chr2:95563236   126017  -   chr2    95563236    95572666
1314    chr2    95614473    95649363    34890   chr2:97035158   394821  -   chr2    97035158    97070173
1314    chr2    95649368    95658543    9175    chr2:97616847   177822  -   chr2    97616847    97626039
164     chr2    95775062    95814080    39018   chr2:97578938   0       -   chr2    97578938    97616780
1315    chr2    95778788    95781856    3068    chr2:97609982   31302   -   chr2    97609982    97616788
164     chr2    95780657    95829665    49008   chr2:96053880   882178  -   chr2    96053880    96102738
1316    chr2    95829982    95865446    35464   chr2:97296848   242680  -   chr2    97296848    97333087
1316    chr2    95829982    95935104    105122  chr2:97438085   1169669 +   chr2    97438085    97544431
1317    chr2    96036782    96052481    15699   chr2:95428891   205673  +   chr2    95428891    95443771
like image 672
user645600 Avatar asked Jul 30 '11 01:07

user645600


People also ask

Is there a way to match up two columns in Excel?

Navigate to the "Home" option and select duplicate values in the toolbar. Next, navigate to Conditional Formatting in Excel Option. A new window will appear on the screen with options to select "Duplicate" and "Unique" values. You can compare the two columns with matching values or unique values.


1 Answers

You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:

which( outer(indat$V4, indat$V11, "==") & 
       outer(indat$V3, indat$V10, "=="), 
       arr.ind=TRUE)
# result
      row col
 [1,]  19   1
 [2,]  10   3
 [3,]   7   6
 [4,]   8   6
 [5,]   6   7
 [6,]  11   8
 [7,]   3  10
 [8,]   7  11
 [9,]   8  11
[10,]   1  19

The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:

unique( c(which( outer(indat$V4, indat$V11, "==") & 
outer(indat$V3, indat$V10, "=="), 
arr.ind=TRUE) ))

#[1] 19 10  7  8  6 11  3  1

So the set with no matches, assuming a data.frame named indat, is:

matches <- unique( c(which( outer(indat$V4, indat$V11, "==") & 
                      outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
indat[ ! 1:NROW(indat) %in% matches, ]

And the ones with matches are:

indat[ 1:NROW(indat) %in% matches, ]
like image 52
IRTFM Avatar answered Oct 04 '22 21:10

IRTFM