I have a data set that contains some logical columns and would like to replace values that are 'TRUE' with the corresponding column name. I asked a similar question here and was able to identify an appropriate solution with the help of some suggestions from other S/O users. However, the solution does not use data.table syntax and copies the whole dataset instead of replacing by reference, which is time consuming.
What is the most appropriate way to do this using data.table syntax?
I tried this:
# Load library
library(data.table)
# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5),
ptname = c("jack", "jill", "jo", "frankie", "claire"),
sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T),
orange = c(F,T,F,T,F),
pear = c(T,T,T,T,F))
# View dummy data:
> mydt
id ptname sex apple orange pear
1: 1 jack m TRUE FALSE TRUE
2: 2 jill f FALSE TRUE TRUE
3: 3 jo f FALSE FALSE TRUE
4: 4 frankie m TRUE TRUE TRUE
5: 5 claire f TRUE FALSE FALSE
# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
trans <- setNames(newval, oldval)
trans[ match(datacol, names(trans)) ]
}
# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))
# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(FALSE, TRUE),
newval = c("FALSE", names(.SD))), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE apple
2: 2 jill f FALSE apple apple
3: 3 jo f FALSE FALSE apple
4: 4 frankie m apple apple apple
5: 5 claire f apple FALSE FALSE
This isn't correct as instead of iterating through each column name for the replacement values, it just recycles the first one ("apple" in this case).
Moreover, if I reverse the order of old and new values, the function ignores my character string replacement for the second value and uses the first two column names as replacements in all cases:
# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(TRUE, FALSE),
newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple orange apple
2: 2 jill f orange apple apple
3: 3 jo f orange orange apple
4: 4 frankie m apple apple apple
5: 5 claire f apple orange orange
I'm sure I'm probably missing something simple but does anyone know why the function does not iterate through the column names (and how to edit it to do this)?
My expected output would be as follows:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE pear
2: 2 jill f FALSE orange pear
3: 3 jo f FALSE FALSE pear
4: 4 frankie m apple orange pear
5: 5 claire f apple FALSE FALSE
Alternatively any other suggestions of concise data.table syntax to achieve this would be much appreciated.
We can use a melt/dcast
approach
dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
value1 := as.character(value)][(value), value1 := variable],
id + ptname + sex~variable, value.var = "value1")
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
Or another option is with set
which would be more efficient
nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
Or another option mentioned in the comments is
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
UPDATE: Comparing options two and three (one is not possible due to the number of non-logical columns) with a data set comprising 18573 rows and 650 columns, of which 252 columns are logical runs with the following timings:
# Option 2:
nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)]))
system.time(
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
)
# user system elapsed
# 0.61 0.00 0.61
# Option 3:
system.time(
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
)
#user system elapsed
#0.65 0.00 0.66
Both are significantly faster than the original approach not using data.table syntax:
# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
system.time(
mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
)
# user system elapsed
# 1.22 0.03 4.22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With