Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching strings in a column of a data frame with the strings in a column of another data frame using R or Python

I am trying to match strings in a column of a data frame with the strings in a column of another data frame and map the corresponding values. The number of rows are different for both data frames

df1 = data.frame(name = c("(CKMB)Creatinine Kinase Muscle & Brain", "24 Hours Urine for Sodium", "Antistreptolysin O Titer", "Blood group O", lonic_code = c("27816-8-O", "27816-8-B", "1869-7", "33914-3")
df2 = data.frame(Testcomponents = c("creatinine", "blood", "potassium"))

Expected output

Test Components          lonic_code
creatinine                27816-8-O
 blood                      1869-7
potassium                    NA
like image 212
ajax Avatar asked Jan 31 '18 06:01

ajax


3 Answers

regex_right_join could be handy in this case.

library(fuzzyjoin)
library(dplyr)

df1 %>%
  mutate(name = as.character(name)) %>%
  regex_right_join(df2 %>%
                     mutate(Testcomponents = as.character(Testcomponents)), 
                   by = c(name = "Testcomponents"), ignore_case = T) %>%
  select(Testcomponents, lonic_code)

Output is:

  Testcomponents lonic_code
1     creatinine  27816-8-O
2          blood    33914-3
3      potassium       <NA>

Sample data:

df1 <- structure(list(name = structure(1:4, .Label = c("(CKMB)Creatinine Kinase Muscle & Brain", 
"24 Hours Urine for Sodium", "Antistreptolysin O Titer", "Blood group O"
), class = "factor"), lonic_code = structure(c(3L, 2L, 1L, 4L
), .Label = c("1869-7", "27816-8-B", "27816-8-O", "33914-3"), class = "factor")), .Names = c("name", 
"lonic_code"), row.names = c(NA, -4L), class = "data.frame")

df2 <- structure(list(Testcomponents = structure(c(2L, 1L, 3L), .Label = c("blood", 
"creatinine", "potassium"), class = "factor")), .Names = "Testcomponents", row.names = c(NA, 
-3L), class = "data.frame")
like image 78
1.618 Avatar answered Oct 05 '22 14:10

1.618


Here is a possible solution. Probably not the most beautiful one, so curious to see other solution approaches.

df1 = data.frame(name = c("(CKMB)Creatinine Kinase Muscle & Brain", "24 Hours Urine for Sodium", "Antistreptolysin O Titer", "Blood group O"), lonic_code = c("27816-8-O", "27816-8-B", "1869-7", "33914-3"))
df2 = data.frame(Testcomponents = c("creatinine", "blood", "potassium"))

result = lapply(sapply(df2$Testcomponents,function(x) {
  which(sapply(df1$name,function(y) {grepl(x,y,ignore.case = T)}))}),function(z) {df1$lonic_code[z]})

df2$Ionic_code= result

Output:

  Testcomponents Ionic_code
1     creatinine          3
2          blood          4
3      potassium           
like image 43
Florian Avatar answered Oct 05 '22 13:10

Florian


This is a little more code than Florian's answer, however, I think it makes up for it by being easier to read:

df1 = data.frame(Testcomponent = c("Albumin", "HDL Cholesterol",
                                   "Erythrocyte Sedimentation Rate (ESR)", "Thyroid-stimulating Hormone (TSH)"))

df2 = data.frame(Names = c("Micro Albumin", "Serum Globulin", "CMV Antibody (IgG)"), lonic_code = c("10501-5", "5196", "EKC 1"))

get.test.component <- function(component.name) {
  component <- grep(component.name, df2$Names)
  if (length(component) == 0) {
    return (NA)
  } else {
    return (as.character(df2$lonic_code[component]))
  }
}

new.ionic.codes <- Reduce(c, lapply(df1$Testcomponent, function(x) get.test.component(x)))
df1.new <- cbind(df1, new.ionic.codes)
like image 40
Cactus Avatar answered Oct 05 '22 14:10

Cactus