I have a database of employees, with their manager's id, in long format (one row per employee per month). I would like to add a column that contains their manager's manager's id (or the id of their skip level manager).
Here is a toy dataset:
id <- c(seq.int(1,11), seq.int(2,12))
mgr_id <- as.integer(c(NA, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5, #period 1
NA, 2, 5, 2, 5, 5, 3, 3, 5, 10, 10)) #period 2
period <- c(rep(1, 11), rep(2, 11))
left_company <- c(1, rep(0, 21))
joined_company <- c(rep(0, 21), 1)
df <- data.frame(id, mgr_id, period, left_company, joined_company)
And here is a function I wrote that returns the expected results.
# finds the employee's manager in the correct period, and returns that manager's id
get_mgr_mgr_id <- function(manager_id, period){
mgr_mgr_id <- df$mgr_id[df$id == manager_id & df$period == period]
return(mgr_mgr_id[1])
}
When I use the function with mapply
, all is well. Note that employee 1 left the company, and they were replaced by employee 5, who was replaced by employee 10, who was replaced by employee 12, a new hire.
df$mgr_mgr_id <- mapply(get_mgr_mgr_id, df$mgr_id, df$period)
df
id mgr_id period left joined mgr_mgr_id
1 1 NA 1 1 0 NA
2 2 1 1 0 0 NA
3 3 1 1 0 0 NA
4 4 2 1 0 0 1
5 5 2 1 0 0 1
6 6 2 1 0 0 1
7 7 2 1 0 0 1
8 8 3 1 0 0 1
9 9 3 1 0 0 1
10 10 5 1 0 0 2
11 11 5 1 0 0 2
12 2 NA 2 0 0 NA
13 3 2 2 0 0 NA
14 4 5 2 0 0 2
15 5 2 2 0 0 NA
16 6 5 2 0 0 2
17 7 5 2 0 0 2
18 8 3 2 0 0 2
19 9 3 2 0 0 2
20 10 5 2 0 0 2
21 11 10 2 0 0 5
22 12 10 2 0 1 5
My question: is there is a more efficient way to get this result? Currently it takes quite a long time to run even on 10,000 rows, and my dataset has closer to a million.
I'm also open to suggestions on a more general question title (possibly a version of this SQL question: Most efficient way to find something recursively in a table?)
SELECT supplier_id FROM suppliers INTERSECT SELECT supplier_id FROM orders; In this SQL INTERSECT example, if a supplier_id appeared in both the suppliers and orders table, it would appear in your result set. Now, let's complicate our example further by adding WHERE conditions to the INTERSECT query.
How do you retrive department ID column without any duplication from employee relation? SELECT MANAGER_ID, DEPARTMENT_ID FROM EMPLOYEES; We can see duplicate result values in above query. If we want to eliminate the duplicate entries, we can use DISTINCT clause.
You can run a join with data.table. I'm not sure how much faster it will be:
library(data.table)
setDT(df)
df[, m2id := df[.(id = mgr_id, period = period), on=c("id", "period"), mgr_id]]
id mgr_id period left_company joined_company m2id
1: 1 NA 1 1 0 NA
2: 2 1 1 0 0 NA
3: 3 1 1 0 0 NA
4: 4 2 1 0 0 1
5: 5 2 1 0 0 1
6: 6 2 1 0 0 1
7: 7 2 1 0 0 1
8: 8 3 1 0 0 1
9: 9 3 1 0 0 1
10: 10 5 1 0 0 2
11: 11 5 1 0 0 2
12: 2 NA 2 0 0 NA
13: 3 2 2 0 0 NA
14: 4 5 2 0 0 2
15: 5 2 2 0 0 NA
16: 6 5 2 0 0 2
17: 7 5 2 0 0 2
18: 8 3 2 0 0 2
19: 9 3 2 0 0 2
20: 10 5 2 0 0 2
21: 11 10 2 0 0 5
22: 12 10 2 0 1 5
id mgr_id period left_company joined_company m2id
How it works
The syntax for a join is x[i, on=, j]
. It uses i
and on
to subset x
and then returns j
. The key point here is setting id = mgr_id
in i
so we're subsetting to the manager's rows.
The syntax for assigning a column is DT[, col_name := value]
. In this case, the value comes from the join explained in the last paragraph.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With