I've been struggling for a while now and I can't find a way out. Here's my problem.
I have 2 dataframes:
df1 <- data.frame(replicate(3,sample(1:10,20,rep=TRUE)))
df1
X1 X2 X3
1 10 1 9
2 3 4 2
3 7 6 8
4 8 10 7
5 5 7 5
6 8 5 9
7 9 8 4
8 6 2 7
9 2 9 6
10 5 2 9
df2 <- data.frame(df1[sample(nrow(df1),4), ])
df2
X1 X2 X3
8 6 2 7
3 7 6 8
10 5 2 9
7 9 8 4
I would like to create a vector x of length(x) = length(df1) containing, per each row of df1, the row index of the corresponding row in df2 (i.e. same exact values for each column between df1 and df2).
Consider that:
dim(df1)
[1] 1096188 3
dim(df2)
[1] 256 3
and that df1 has several rows with the same values (i.e. the corresponding row index will be the same), and that in principle all the rows in df1 should find a match with the row in df2.
The expected output would be:
x
[1] 0 0 2 0 0 0 4 1 0 3
Hope this was clear enough...
Can you help?
Thanks,
Piera
Here is an option with data.table
:
require(data.table)
# first set the original orders (data.frame will be sorted when doing setkey)
setDT(df1)[, ori := .I]
setDT(df2)[, ind_df2 := .I]
# define keys
setkey(df1, X1, X2, X3)
setkey(df2, X1, X2, X3)
# compute the indices of the df1 line in df2
x <- df2[df1, ind_df2]
# put the nomatch to 0
x[is.na(x)] <- 0
# Finally, put the original orders back and delete the variable ori
x <- x[order(df1$ori)]
df2 <- df2[order(df2$ind_df2)]
df1[, ori:=NULL]
df2[, ind_df2:=NULL]
resulting x (with your data):
x
#[1] 0 0 2 0 0 0 4 1 0 3
Another, more simple and efficient option, suggested by @Frank:
setkeyv(setDT(df2)[,ii:=.I],setdiff(names(df2),"ii"))
x <- df2[df1]$ii
x[is.na(x)] <- 0
Some benchmark between @nicola answer, @Frank suggestion and my answer, on a 100000 rows df1 and 200 rows df2, with a slight modification of nicola's answer to get the desired output (both functions give the same result, except the need for as.numeric
for nicola's):
so:
set.seed(17)
df1 <- data.frame(replicate(3,sample(1:100,100000,rep=TRUE)))
df2 <- data.frame(df1[sample(nrow(df1),200), ])
nicola <- function(){x<-match(do.call(paste,df1),do.call(paste,df2), nomatch=0)}
cath <- function(){
dt1 <-data.table(df1); dt1[, ori:=.I]
dt2 <- data.table(df2); dt2[, ind_df2:=.I]
setkey(dt1, X1, X2, X3)
setkey(dt2, X1, X2, X3)
x <- dt2[dt1, ind_df2]
x[is.na(x)] <- 0
x <- x[order(dt1$ori)]
x
}
Frank <- function(){dt1 <-data.table(df1);dt2 <- data.table(df2); setkey(setDT(dt2)[,ii:=.I],X1,X2,X3); x <- dt2[dt1]$ii;x[is.na(x)] <- 0}
require(microbenchmark)
microbenchmark(cath(), Frank(), nicola(), unit="relative", times=100)
#Unit: relative
# expr min lq mean median uq max neval cld
#Frank() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# cath() 3.238195 3.099896 2.438342 2.767165 2.177365 1.447397 100 b
#nicola() 13.127820 12.476996 8.761549 10.899191 7.292086 2.783436 100 c
I'd just try:
x <- rownames(df2)[match(do.call(paste, df1), do.call(paste, df2))]
x[is.na(x)] <- 0
There is quite a discussion on what it's the desired output; in @CathG interpretation, this line produces it:
match(do.call(paste, df1), do.call(paste, df2),nomatch=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With