Recoding is a common practice for survey data, but the most obvious routes take more time than they should.
The fastest code that accomplishes the same task with the provided sample data by system.time()
on my machine wins.
## Sample data
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Code to optimize.
for(x in 1:ncol(dat)) {
dat[,x] <- factor(dat[,x], labels=re.codes)
}
Current system.time()
:
user system elapsed
4.40 0.10 4.49
Hint: dat <- lapply(1:ncol(dat), function(x) dat[,x] <- factor(dat[,x],labels=rc)))
is not any faster.
Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:
system.time({
dat3 <- list()
# define attributes once outside of loop
attrib <- list(class="factor", levels=re.codes)
for (i in names(dat)) { # loop over each column in 'dat'
dat3[[i]] <- as.integer(dat[[i]]) # convert column to integer
attributes(dat3[[i]]) <- attrib # assign factor attributes
}
# convert 'dat3' into a data.frame. We can do it like this because:
# 1) we know 'dat' and 'dat3' have the same number of rows and columns
# 2) we want 'dat3' to have the same colnames as 'dat'
# 3) we don't care if 'dat3' has different rownames than 'dat'
attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
class="data.frame", names=names(dat))
})
identical(dat2, dat3) # 'dat2' is from @Dwin's answer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With