I have two data frames, below is a small sample of each:
df1 <- data.frame(a1= c(3,4), a2 = c(8, 8), a3 = c(4, 18), a4 = c(9,9), a5 = c(17, 30))
df2 <- data.frame(a1 = c(2,2,2,3,3,3,4,4,4), a2 = c(7,7,7,7,7,7,7,7,7),
a3 = c(4,4,4,4,4,4,4,4,4), a4 = c(10,10,10, 10, 10, 10, 10,10,10),
a5 = c(15,16,17, 15, 16, 17, 15, 16, 17))
I would like to examine, for each row of df1
, whether it has "neighbors" in df2
, where, by neighbors I mean observations that are different by at most 1 in each column (in absolute value). So for example, row 2 of df2
is a neighbor of row 1 in df1
.
The way I currently do this is the following:
sweep(as.matrix(df2), 2, as.matrix(df1[1,]), "-")
For row 1 of df1
, and I have to repeat this for each row of df1. Note that df2 and df1 do not have the same number of rows.
However, what I would really like is to avoid doing this "by row", because my data frames have many rows. Is there a way to do it vectorially?
You can use split your row of df1
into a list, and then use lapply
to achieve Vectorization:
my_list=lapply(as.list(data.frame(t(df1))),function(x) sweep(as.matrix(df2), 2, as.matrix(x), "-"))
each element of my_list
is the computation result of each row in df1
my_list[[1]]
a1 a2 a3 a4 a5
[1,] -1 -1 0 1 -2
[2,] -1 -1 0 1 -1
[3,] -1 -1 0 1 0
[4,] 0 -1 0 1 -2
[5,] 0 -1 0 1 -1
[6,] 0 -1 0 1 0
[7,] 1 -1 0 1 -2
[8,] 1 -1 0 1 -1
[9,] 1 -1 0 1 0
Also, you can use parallel::mclapply
which is faster than traditional lapply
Here is a possible data.table
approach using non-equi joins
library(data.table)
cols <- names(df2)
#convert into data.table and add row index for clarity
setDT(df1)[, rn1 := .I]
setDT(df2)[, rn2 := .I]
#create a lower (-1) and upper (+1) bound on each column
bandsNames <- paste0(rep(cols, each=2L), "_", rep(c("lower", "upper"), length(cols)))
df2Bands <- df2[,
{
ans <- do.call(cbind, lapply(.SD, function(x) outer(x, c(-1L, 1L), `+`)))
setnames(data.table(ans), bandsNames)
}, by=.(rn2)]
#create the non-equi join conditions
lowerLimits <- paste0(cols, "_lower<=", cols)
upperLimits <- paste0(cols, "_upper>=", cols)
#perform the non-equi join on lower and upper limits and return the count
#`:=` add a new column in df1 by reference
df1[, Count :=
df2Bands[df1, .N, by=.EACHI, on=c(lowerLimits, upperLimits)]$N
]
desired output:
a1 a2 a3 a4 a5 rn1 Count
1: 3 8 4 9 17 1 6
2: 4 8 18 9 30 2 0
If you want to find the matching rows as well:
df2Bands[df1, .(rn1=i.rn1, rn2=x.rn2), by=.EACHI, on=c(lowerLimits, upperLimits)][,
-(1L:length(bandsNames))]
Matched rows:
rn1 rn2
1: 1 2
2: 1 3
3: 1 5
4: 1 6
5: 1 8
6: 1 9
7: 2 NA
I do not think there is a good way to fully vectorise this problem, (apply family are really just for loops in a bow tie). But you can do it on a by column basis, rather than by row. If further improvement is required the size of the problem can be reduced after each column by removing rows that can be excluded from ever matching (this will cause an indexing headache, but is relatively do-able).
My attempt is below which uses a for loop (which could be replaced by lapply). It returns a truth matrix, rows with a 1 can be matched to columns with a 1, which gives the pairing of neighbours.
col_comp = function(x,y)
{
lx = length(x)
ly = length(y)
return(abs(rep(x,ly) - rep(y,each = lx) )<=1)
}
full_comp=function(df1,df2)
{
rows1 = seq_len(nrow(df1))
rows2 = seq_len(nrow(df2))
M = matrix(1L, nrow=nrow(df1),ncol=nrow(df2))
for(i in seq_len(ncol(df1)) )
{
matches = col_comp(df1[rows1,i],df2[rows2,i])
M = M*matches
}
return(M)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With