I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.
## Toy example data = data.frame( id = 1:7, x = c("A", "A", "B", "C", "D", "AA", ".") ) lookup = data.frame( old = c("A", "D", "."), new = c("a", "d", "!") ) ## desired result # id x # 1 1 a # 2 2 a # 3 3 B # 4 4 C # 5 5 d # 6 6 AA # 7 7 !
I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.
## This works, but is more steps than I want library(dplyr) data %>% left_join(lookup, by = c("x" = "old")) %>% mutate(x = coalesce(new, x)) %>% select(-new)
It can also be done with dplyr::recode
, as below, converting the lookup table to a named lookup vector. I prefer lookup
as a data frame, but I'm okay with the named vector solution. My concern here is that recode
is the Questioning lifecycle phase, so I'm worried that this method isn't stable.
lookup_v = pull(lookup, new) %>% setNames(lookup$old) data %>% mutate(x = recode(x, !!!lookup_v))
It could also be done with, say, stringr::str_replace
, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode
is a stable version of recode
, but I don't want a factor
output (though mutate(x = as.character(fct_recode(x, !!!lookup_v)))
is perhaps my favorite option so far...).
I had hoped that the new-ish rows_update()
family of dplyr
functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)
Summary of my requirements:
character
class input. Working more generally is a nice-to-have.tidyverse
packages (though I'd also be interested in seeing a data.table
solution)A direct data.table
solution, without %in%
.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.
library(data.table) setDT(data) setDT(lookup) ## If needed # setkey(data,x) # setkey(lookup,old) data[lookup, x:=new, on=.(x=old)] data id x 1: 1 a 2: 2 a 3: 3 B 4: 4 C 5: 5 d 6: 6 AA 7: 7 !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With