Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to find manager's manager's id

I have a database of employees, with their manager's id, in long format (one row per employee per month). I would like to add a column that contains their manager's manager's id (or the id of their skip level manager).

Here is a toy dataset:

id <- c(seq.int(1,11), seq.int(2,12))
mgr_id <- as.integer(c(NA, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5,   #period 1
                      NA, 2, 5, 2, 5, 5, 3, 3, 5, 10, 10)) #period 2
period <- c(rep(1, 11), rep(2, 11))
left_company <- c(1, rep(0, 21))
joined_company <- c(rep(0, 21), 1) 

df <- data.frame(id, mgr_id, period, left_company, joined_company)

And here is a function I wrote that returns the expected results.

# finds the employee's manager in the correct period, and returns that manager's id

    get_mgr_mgr_id <- function(manager_id, period){
      mgr_mgr_id <- df$mgr_id[df$id == manager_id & df$period == period] 
      return(mgr_mgr_id[1])
    }

When I use the function with mapply, all is well. Note that employee 1 left the company, and they were replaced by employee 5, who was replaced by employee 10, who was replaced by employee 12, a new hire.

df$mgr_mgr_id <- mapply(get_mgr_mgr_id, df$mgr_id, df$period)

df
   id mgr_id period left joined mgr_mgr_id
1   1     NA      1    1      0         NA
2   2      1      1    0      0         NA
3   3      1      1    0      0         NA
4   4      2      1    0      0          1
5   5      2      1    0      0          1
6   6      2      1    0      0          1
7   7      2      1    0      0          1
8   8      3      1    0      0          1
9   9      3      1    0      0          1
10 10      5      1    0      0          2
11 11      5      1    0      0          2
12  2     NA      2    0      0         NA
13  3      2      2    0      0         NA
14  4      5      2    0      0          2
15  5      2      2    0      0         NA
16  6      5      2    0      0          2
17  7      5      2    0      0          2
18  8      3      2    0      0          2
19  9      3      2    0      0          2
20 10      5      2    0      0          2
21 11     10      2    0      0          5
22 12     10      2    0      1          5

My question: is there is a more efficient way to get this result? Currently it takes quite a long time to run even on 10,000 rows, and my dataset has closer to a million.

I'm also open to suggestions on a more general question title (possibly a version of this SQL question: Most efficient way to find something recursively in a table?)

like image 611
BLT Avatar asked Aug 29 '16 16:08

BLT


People also ask

Which query can fetch supplier IDS?

SELECT supplier_id FROM suppliers INTERSECT SELECT supplier_id FROM orders; In this SQL INTERSECT example, if a supplier_id appeared in both the suppliers and orders table, it would appear in your result set. Now, let's complicate our example further by adding WHERE conditions to the INTERSECT query.

How can we retrieve department ID column without any duplication from employee relation?

How do you retrive department ID column without any duplication from employee relation? SELECT MANAGER_ID, DEPARTMENT_ID FROM EMPLOYEES; We can see duplicate result values in above query. If we want to eliminate the duplicate entries, we can use DISTINCT clause.


1 Answers

You can run a join with data.table. I'm not sure how much faster it will be:

library(data.table)
setDT(df)

df[, m2id := df[.(id = mgr_id, period = period), on=c("id", "period"), mgr_id]]

    id mgr_id period left_company joined_company m2id
 1:  1     NA      1            1              0   NA
 2:  2      1      1            0              0   NA
 3:  3      1      1            0              0   NA
 4:  4      2      1            0              0    1
 5:  5      2      1            0              0    1
 6:  6      2      1            0              0    1
 7:  7      2      1            0              0    1
 8:  8      3      1            0              0    1
 9:  9      3      1            0              0    1
10: 10      5      1            0              0    2
11: 11      5      1            0              0    2
12:  2     NA      2            0              0   NA
13:  3      2      2            0              0   NA
14:  4      5      2            0              0    2
15:  5      2      2            0              0   NA
16:  6      5      2            0              0    2
17:  7      5      2            0              0    2
18:  8      3      2            0              0    2
19:  9      3      2            0              0    2
20: 10      5      2            0              0    2
21: 11     10      2            0              0    5
22: 12     10      2            0              1    5
    id mgr_id period left_company joined_company m2id

How it works

The syntax for a join is x[i, on=, j]. It uses i and on to subset x and then returns j. The key point here is setting id = mgr_id in i so we're subsetting to the manager's rows.

The syntax for assigning a column is DT[, col_name := value]. In this case, the value comes from the join explained in the last paragraph.

like image 113
Frank Avatar answered Oct 12 '22 02:10

Frank