Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create new column in dataframe based on partial string matching other column

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.

The dataframe is as follows:

      GL                             GLDESC 1 515100         Payroll-Indir Salary Labor 2 515900 Payroll-Indir Compensated Absences 3 532300                           Bulk Gas 4 539991                     Area Charge In 5 551000        Repairs & Maint-Spare Parts 6 551100                 Supplies-Operating 7 551300                        Consumables 

For each row of the data table:

  • If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
  • If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
  • In all other cases I want KIND to be Other

I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.

like image 650
user2948714 Avatar asked Nov 02 '13 22:11

user2948714


1 Answers

Since you have only two conditions, you can use a nested ifelse:

#random data; it wasn't easy to copy-paste yours   DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),    c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",       "asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))  DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",           ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))  DF #   GL         GLDESC      KIND #1   8        e gas l Materials #2   1  c payroll12 y   Payroll #3  10      m GaSer v Materials #4   6       t asdf n     Other #5   2      w qweaa t     Other #6   4 r PayROll-12 q   Payroll #7   9      n asdfg a     Other #8   5     d GAS--2 w Materials #9   7     s fghfgh e     Other #10  3      g qweee k     Other 

EDIT 10/3/2016 (..after receiving more attention than expected)

A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:

ff = function(x, patterns, replacements = patterns, fill = NA, ...) {     stopifnot(length(patterns) == length(replacements))      ans = rep_len(as.character(fill), length(x))         empty = seq_along(x)      for(i in seq_along(patterns)) {         greps = grepl(patterns[[i]], x[empty], ...)         ans[empty[greps]] = replacements[[i]]           empty = empty[!greps]     }      return(ans) }  ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE) # [1] "Materials" "Payroll"   "Materials" "Other"     "Other"     "Payroll"   "Other"     "Materials" "Other"     "Other"  ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),     c("pat1a|pat1b", "pat2", "pat3"),     c("1", "2", "3"), fill = "empty") #[1] "1"     "1"     "3"     "empty"  ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),     c("pat2", "pat1a|pat1b", "pat3"),     c("2", "1", "3"), fill = "empty") #[1] "2"     "1"     "3"     "empty" 
like image 71
alexis_laz Avatar answered Sep 20 '22 05:09

alexis_laz