I have two matrices. Matrix A has 14 variables (but I am only interested in five variables for this exercise) and matrix B has five variables. Matrix A has 250,000 observations and matrix B has 23,000 observations. The variables are non-numeric characters (text).
My goal is to fill columns 9 and 10 of matrix A with information from columns 4 and 5 from matrix B, if columns 1, 2, and 8 from matrix A are equivalent to columns 1, 2, and 3 from matrix B.
The code I have written is a double for loop and takes about 10 days to run. I am looking for a way to simplify the code to make it more efficient and run considerably quicker (ideally in a couple of hours).
The code is as follows:
for (i in 1:nrow(matrix_b)) {
for (j in 1:nrow(matrix_a)) {
if (matrix_a[j,1]==matrix_b[i,1]) {
if (matrix_a[j,2]==matrix_b[i,2]) {
if (matrix_a[j,8]==matrix_b[i,3]) {
matrix_a[j,9]<-matrix_b[i,4]
matrix_a[j,10]<-matrix_b[i,5]
}
}
}
}
}
An example would be:
# Matrix A
[,1] [,2] [,8] [,9] [,10]
[1,] Z10 11 Nov 2012 Rocko Large NA NA
# Matrix B
[,1] [,2] [,3] [,4] [,5]
[1,] Z10 11 Nov 2012 Rocko Large Aldi CFO
In this case, the code should insert "Aldi" in cell [1,9] and "CFO" in cell [1,10] of matrix A, because cells [1,1], [1,2], and [1,8] in matrix A are identical to the cells [1,1], [1,2], and [1,3] in matrix B.
If the data looked like this (the date in column 2 is different), the code should NOT do anything; i.e., leave cells [1,9] and [1,10] in matrix A as they are.
# Matrix A
[,1] [,2] [,8] [,9] [,10]
[1,] Z10 15 Dec 2013 Rocko Large NA NA
# Matrix B
[,1] [,2] [,3] [,4] [,5]
[1,] Z10 11 Nov 2012 Rocko Large Aldi CFO
Here's an approach that avoids the loops altogether and instead uses joins to the same effect:
set.seed(1)
# Create some fake data with the given dimensions
matrix_a <- matrix(sample(letters, 14 * 250000, replace = T), ncol = 14)
matrix_b <- matrix(sample(letters, 5 * 23000, replace = T), ncol = 5)
library(dplyr)
# Convert matrices to data frames
df_a <- as_tibble(matrix_a)
df_b <- as_tibble(matrix_b)
# Simplify df_b so that multiple rows from df_b don't match the same row
# in df_a: this is implied by the loop in the question, and isn't necessary
# to do for the real data, if the condition holds.
df_b <- df_b %>%
distinct(V1, V2, V3, .keep_all = T)
new <- df_a %>%
# join columns from b to a that match on the specified variables
left_join(df_b %>% rename_all(~ paste0("b_", .)),
by = c(V1 = "b_V1", V2 = "b_V2", V8 = "b_V3")) %>%
# if there was a match in b, replace value in a with the value from b
mutate(
V9 = if_else(!is.na(b_V4), b_V4, V9),
V10 = if_else(!is.na(b_V5), b_V5, V10)
) %>%
# drop the added columns from b
select(-starts_with("b_"))
And for the time it takes: the join and mutate took 0.17 seconds on my machine.
If you want to learn more about the data wrangling tools available in dplyr a good place to start is the package vignette and website.
Checking that rows that should change, did change:
should_change <- df_a %>%
mutate(row_id = row_number()) %>%
semi_join(df_b, by = c("V1", "V2", V8 = "V3"))
new %>%
mutate(row_id = row_number()) %>%
select(row_id, V9, V10) %>%
inner_join(should_change %>% select(row_id, V9, V10),
by = "row_id", suffix = c("_new", "_old"))
#> # A tibble: 181,708 x 5
#> row_id V9_new V10_new V9_old V10_old
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 j k p m
#> 2 3 n e f h
#> 3 4 s n v s
#> 4 6 m r j n
#> 5 7 c v t k
#> 6 8 m n u y
#> 7 9 m x l r
#> 8 10 e a z v
#> 9 11 e q l k
#> 10 12 o f z q
#> # ... with 181,698 more rows
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With