I am trying to transition some plyr code to dplyr, and getting stuck with the new functionality of rename() in dplyr. I'd like to be able to reuse a single rename() expression for a set of datasets with overlapping but not identical original names. For example,
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
And then,
rename(sample1, var1 = A, var2 = B, var3 = C)
I would like the result to be that variable A is renamed var1, and B is renamed var2, not adding a var3 in this case. Instead, I get
Error: Unknown variables: C.
In contrast, the plyr syntax would let me use
rename(sample1, c("A" = "var1", "B" = "var2", "C" = "var3"))
rename(sample2, c("A" = "var1", "B" = "var2", "C" = "var3"))
and not throw an error. Is there a way to get the same result in dplyr without getting the Unknown variables error?
I'll just say it once more: if you need to rename variables in R, just use the rename() function.
There may be occasions in which you want to change some of the variable names in your SAS data set. To do so, you'll want to use the RENAME= option. As its name suggests, the RENAME= option allows you to change the variable names within a SAS data set. RENAME = (old1=new1 old2=new2 ....
Method 1: using colnames() method colnames() method in R is used to rename and replace the column names of the data frame in R. The columns of the data frame can be renamed by specifying the new column names as a vector. The new name replaces the corresponding old name of the column in the data frame.
Completely ignoring your actual request on how to do this with dplyr, I would like suggest a different approach using a lookup table:
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
rename_map <- c("A"="var1",
"B"="var2",
"C"="var3")
names(sample1) <- rename_map[names(sample1)]
str(sample1)
names(sample2) <- rename_map[names(sample2)]
str(sample2)
Fundamentally the algorithm is simple:
EDIT: As per Hadley's suggestion, I used a named vector instead of a list, makes life much easier. I always forget about named vectors :(
#no need to use rename
oldnames<-unique(c(names(sample1),names(sample2)))
newnames<-c("var1","var2","var3")
name_df<-data.frame(oldnames,newnames)
mydata<-list(sample1,sample2) # combined two datasets as a list
#one liner
finaldata <- lapply(mydata, function(i) {colnames(i)<-name_df[name_df[,1] %in% colnames(i),2]
return(i)})
> finaldata
[[1]]
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
[[2]]
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
With dplyr
, we can use a named vector with old names as values and new names as names, then unquote only the values in name_vec
that matches names in your dataset. rename
supports unquoting characters, so there is no need to convert them to sym
beforehand:
library(dplyr)
name_vec <- c(var1 = "A", var2 = "B", var3 = "C")
sample1 %>%
rename(!!name_vec[name_vec %in% names(.)])
sample2 %>%
rename(!!name_vec[name_vec %in% names(.)])
Also, with setNames
:
name_vec <- c(A = "var1", B = "var2", C = "var3")
sample1 %>%
setNames(name_vec[names(.)])
sample2 %>%
setNames(name_vec[names(.)])
Output:
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
I’ve used @earino’s answer before
myself, but discovered that it can be unsafe. If column names of the data
frame are missing in the (names of the) named vector, those column names are silently replaced with NA
and that is certainly not what you want.
d1 <- data.frame(A = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
rename_vec <- c("B" = "var2", "C" = "var3")
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
The same can happen, if you run names(d1) <- rename_vec[names(d1)]
twice by accident, because when you run it the second time, none of the
colnames(d1)
are in names(rename_vec)
.
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA: int 1 2 3 4 5 6 7 8 9 10
#> $ NA: chr "a" "b" "c" "d" ...
We just need to select those columns that are in the data frame and in the rename vector.
d2 <- data.frame(B1 = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
sel <- is.element(colnames(d2), names(rename_vec))
names(d2)[sel] <- rename_vec[names(d2)][sel]
str(d2)
#> 'data.frame': 10 obs. of 2 variables:
#> $ B1 : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
UPDATE: I initially had a solution here that involved string replacement, which turned out to be unsafe as well, because it allowed for partial matching. This one is better, I think.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With