Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace wrong values in df2 with true values in df1 by using 2 common columns in R

I have 2 data frames like this

TEAM <- c("PE","PE","MPI","TDT","HPT")
EmpID <- c (444452,444456,16822,339862,14828)    
ManagerID <- c(11499,11599,11899,11339,11559)
CODE <- c("F",NA,"A","H","G")
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE)

TEAM <- c("MPI","TDT","HPT","PE","TDT","PE","MPI","TDT","HPT","PE")
EmpID <- c(444452,444452,444452,339862,339862,16822,339862,16822,14828,14828)
ManagerID <- c(11499,11499,11499,11339,11339,11899,11339,11899,11559,11559)
CODE <- c("A234","H665","G654","F616","H626","F234","H695","G954","G616",NA)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE)

I am trying to update the wrong values of ManagerID & EmpID in df2 with the true values of ManagerID & EmpID in df1 only when the TEAM & the CODE (matching the letter in CODE column in df1 with the first letter of CODE column in df2). If the team matches but the code is not correct, then the wrong values stay and shouldn't be replaced with the values from df1.

My desired output is

   TEAM  EmpID ManagerID CODE
1   MPI  16822     11899 A234
2   TDT 339862     11339 H665
3   HPT  14828     11559 G654
4    PE 444452     11499 F616
5   TDT 339862     11339 H626
6    PE 444452     11499 F234
7   MPI 339862     11339 H695
8   TDT  16822     11899 G954
9   HPT  14828     11559 G616
10   PE 444452     11599 <NA>

You can see that the row 7 & 8 remain unchanged since the codes don't match.

I tried doing it this way with the help from Gregor for my previous question

df2$ManagerID = df1$ManagerID[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2$EmpID = df1$EmpID [match(substr(df2$CODE, 1, 1), df1$CODE)]

I am not sure if I am headed in the right direction. Kindly help me with inputs on how efficiently to solve this.

like image 264
Sharath Avatar asked Dec 07 '25 10:12

Sharath


1 Answers

Since you are doing text subsetting here, I would initialize as character vectors, not factors:

df1 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)

I would then use data table (not necessary, but cleaner):

library(data.table)

# convert data frames to data.table
setDT(df1)
setDT(df2)

You want to create a column with the value to merge here, just taking the first character of df2's CODE

df2[ , C_SHORT := substr(CODE,1,1)]

We then merge the two data frames on the TEAM/CODE combination. This creates NAs where any matches do not exist. Then, test if any of the columns are NA, and insert the initial value if they are

merge(x = df2,y = df1, by.x = c("TEAM","C_SHORT"), by.y = c("TEAM","CODE"), all.x = TRUE)[
  ,
  .(
    TEAM,
    EmpID = ifelse(is.na(EmpID.y), EmpID.x, EmpID.y), 
    ManagerID = ifelse(is.na(ManagerID.y), ManagerID.x, ManagerID.y),
    CODE
  )
  ]
    TEAM  EmpID ManagerID CODE
 1:  HPT  14828     11559 G654
 2:  HPT  14828     11559 G616
 3:  MPI  16822     11899 A234
 4:  MPI 339862     11339 H695
 5:   PE 444456     11599   NA
 6:   PE 444452     11499 F616
 7:   PE 444452     11499 F234
 8:  TDT  16822     11899 G954
 9:  TDT 339862     11339 H665
10:  TDT 339862     11339 H626

One note: You are using NA here as a lookup here. This works in the merge function (which I did not realize), but IMO this is bad practice (NA in R really refers to missing data, whereas here it encodes something). I would think about changing how this is represented in your data.

like image 89
Chris Avatar answered Dec 10 '25 01:12

Chris



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!