I have two data frames in R, and I need to count the element matches row by row, getting finally a column with the length of the cartesian product of both tables and the IDs of both rows. Also, the tables are quite big and with different number of rows, but same number of columns.
I have the following code, but it is quite slow when having multiple runs.
library(data.table)
table_1<-data.table(matrix(c(1:24),nrow = 4))
table_2<-data.table(matrix(c(11:34),nrow = 4))
names(table_1)<-c("s1", "s2","s3","s4","s5","s6")
names(table_2)<-c("a1","a2","a3","a4","a5","a6")
table_1$ID<-seq.int(nrow(table_1))
table_2$ID_ap<-seq.int(nrow(table_2))
setcolorder(table_1, c("ID", "s1", "s2","s3","s4","s5","s6"))
setcolorder(table_2, c("ID_ap","a1","a2","a3","a4","a5","a6"))
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
join<-CJ.table(table_1,table_2)
R<-subset(join, select=c("ID_ap","ID"))
R$Ac<- (join$s1 == join$a1) + (join$s1 ==join$a2) + (join$s1 ==join$a3) + (join$s1 ==join$a4) + (join$s1 ==join$a5) + (join$s1 ==join$a6)+
(join$s2 == join$a1) + (join$s2 ==join$a2) + (join$s2 ==join$a3) + (join$s2 ==join$a4) + (join$s2 ==join$a5) + (join$s2 ==join$a6)+
(join$s3 == join$a1) + (join$s3 ==join$a2) + (join$s3 ==join$a3) + (join$s3 ==join$a4) + (join$s3 ==join$a5) + (join$s3 ==join$a6)+
(join$s4 == join$a1) + (join$s4 ==join$a2) + (join$s4 ==join$a3) + (join$s4 ==join$a4) + (join$s4 ==join$a5) + (join$s4 ==join$a6)+
(join$s5 == join$a1) + (join$s5 ==join$a2) + (join$s5 ==join$a3) + (join$s5 ==join$a4) + (join$s5 ==join$a5) + (join$s5 ==join$a6)+
(join$s6 == join$a1) + (join$s6 ==join$a2) + (join$s6 ==join$a3) + (join$s6 ==join$a4) + (join$s6 ==join$a5) + (join$s6 ==join$a6)
which gives
R
ID_ap ID Ac
1: 1 1 0
2: 1 2 0
3: 1 3 4
4: 1 4 0
5: 2 1 0
6: 2 2 0
7: 2 3 0
8: 2 4 4
9: 3 1 3
10: 3 2 0
11: 3 3 0
12: 3 4 0
13: 4 1 0
14: 4 2 3
15: 4 3 0
16: 4 4 0
R provides us nrow() function to get the rows for an object. That is, with nrow() function, we can easily detect and extract the number of rows present in an object that can be matrix, data frame or even a dataset.
Group By Count in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by count in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df to get the group by count.
To find the row corresponding to a nearest value in an R data frame, we can use which. min function after getting the absolute difference between the value and the column along with single square brackets for subsetting the row.
If you want to count by multiple conditions, add them all to the filter function. Count function from dplyr package is one simple function and sometimes all that is necessary at the beginning of the analysis.
Put the data in long format, since the column order does not matter:
setnames(table_2, "ID_ap", "ID")
tabs = rbind(
melt(table_1, id="ID")[, variable := NULL],
melt(table_2, id="ID")[, variable := NULL],
idcol = TRUE)
(1) For each value, identify relevant pairs; and
(2) for pairs, count values:
tabs[,
if (uniqueN(.id) > 1L) CJ(ID1 = ID[.id == 1L], ID2 = ID[.id == 2L])
, by=value][,
.N
, by=.(ID1, ID2)]
ID1 ID2 N
1: 3 1 4
2: 4 2 4
3: 1 3 3
4: 2 4 3
All other (ID1, ID2)
combos are zero and need not be explicitly enumerated, I think.
If values are distinct within each table, as in the OP's example, then we can simplify:
tabs[, if (.N==2L) .(ID1 = ID[1L], ID2 = ID[2L]), by=value][, .N, by=.(ID1, ID2)]
Assuming that the product of the number of rows and the number of unique values in both tables is not large:
x1 = unlist(table_1, FALSE, FALSE)
x2 = unlist(table_2, FALSE, FALSE)
with common unique values:
lvs = union(x1, x2)
And tabulate
the occurence of each unique value in each row for each table:
tab1 = matrix(tabulate(seq_len(nrow(table_1)) + (match(x1, lvs) - 1L) * nrow(table_1),
nrow(table_1) * length(lvs)),
nrow(table_1), length(lvs))
tab2 = matrix(tabulate(seq_len(nrow(table_2)) + (match(x2, lvs) - 1L) * nrow(table_2),
nrow(table_2) * length(lvs)),
nrow(table_2), length(lvs))
finally:
tcrossprod(tab1, tab2) #or 'tcrossprod(tab1 > 0L, tab2 > 0L)' to not count duplicate matches
# [,1] [,2] [,3] [,4]
#[1,] 0 0 3 0
#[2,] 0 0 0 3
#[3,] 4 0 0 0
#[4,] 0 4 0 0
#and to change format (among different ways):
ans = tcrossprod(tab1, tab2)
cbind(c(row(ans)), c(col(ans)), c(ans))
If tab1
and tab2
are very large, they can be built as sparse matrices and a way could be:
library(Matrix)
stab1 = xtabs(rep_len(1L, length(x1)) ~
rep_len(seq_len(nrow(table_1)), length(x1))
+ factor(match(x1, lvs), lvs),
sparse = TRUE)
stab2 = xtabs(rep_len(1L, length(x2)) ~
rep_len(seq_len(nrow(table_2)), length(x2))
+ factor(match(x2, lvs), lvs),
sparse = TRUE)
tcrossprod(stab1, stab2)
#4 x 4 sparse Matrix of class "dgCMatrix"
# 1 2 3 4
#1 . . 3 .
#2 . . . 3
#3 4 . . .
#4 . 4 . .
EDIT
Having (1) small positive integer values and (2) distinct values in each row, creating lookups with match
/unique
/union
and tabulating can be avoided:
x1 = unlist(table_1, FALSE, FALSE)
x2 = unlist(table_2, FALSE, FALSE)
nlvs = max(max(x1), max(x2))
stab1 = sparseMatrix(i = rep_len(seq_len(nrow(table_1)), length(x1)),
j = x1,
x = 1L,
dims = c(nrow(table_1), nlvs))
stab2 = sparseMatrix(i = rep_len(seq_len(nrow(table_2)), length(x2)),
j = x2,
x = 1L,
dims = c(nrow(table_2), nlvs))
tcrossprod(stab1, stab2)
#4 x 4 sparse Matrix of class "dgCMatrix"
#
#[1,] . . 3 .
#[2,] . . . 3
#[3,] 4 . . .
#[4,] . 4 . .
summary(tcrossprod(stab1, stab2))
#4 x 4 sparse Matrix of class "dgCMatrix", with 4 entries
# i j x
#1 3 1 4
#2 4 2 4
#3 1 3 3
#4 2 4 3
How about:
colSums(apply(join[, !c("ID", "ID_ap"), with = F], 1, duplicated))
#[1] 0 0 4 0 0 0 0 4 3 0 0 0 0 3 0 0
Or, starting from scratch:
setkey(table_1, ID)
setkey(table_2, ID_ap)
ids = CJ(ID1 = table_1$ID, ID2 = table_2$ID_ap)
ids[, sum(duplicated(c(table_1[.(ID1), !'ID', with = F],
table_2[.(ID2), !'ID_ap', with = F])))
, by = .(ID1, ID2)]
# ID1 ID2 V1
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 3
# 4: 1 4 0
# 5: 2 1 0
# 6: 2 2 0
# 7: 2 3 0
# 8: 2 4 3
# 9: 3 1 4
#10: 3 2 0
#11: 3 3 0
#12: 3 4 0
#13: 4 1 0
#14: 4 2 4
#15: 4 3 0
#16: 4 4 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With