What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.
race.White race.Hispanic race.Black race.Asian
1 1 0 0 0
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
5 0 0 0 1
6 0 1 0 0
7 1 0 0 0
8 1 0 0 0
9 1 0 0 0
10 0 0 1 0
Desired output:
race
1 White
2 Asian
3 White
4 Black
5 Asian
6 Hispanic
7 White
8 White
9 White
10 Black
Data:
dat <- structure(list(race.White = c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L,
1L, 0L), race.Hispanic = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L), race.Black = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L),
race.Asian = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("race.White",
"race.Hispanic", "race.Black", "race.Asian"), row.names = c(NA,
-10L), class = "data.frame")
What I tried:
This is a possible solution but I am sure there's a better indexing/dplyr/data.table/.etc solution.
apply(dat, 1, function(x) sub("[^.]+\\.", "", colnames(dat))[x])
Indeed, a dummy variable can take values either 1 or 0. It can express either a binary variable (for instance, man/woman, and it's on you to decide which gender you encode to be 1 and which to be 0), or a categorical variables (for instance, level of education: basic/college/postgraduate).
A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups.
Yes, coefficients of dummy variables can be more than one or less than zero. Remember that you can interpret that coefficient as the mean change in your response (dependent) variable when the dummy changes from 0 to 1, holding all other variables constant (i.e. ceteris paribus).
Reference category in Dummy variableThe category that has the value of 0 is called the reference category, Benchmark or comparison category. So if there are many dummy variables, we must not forgot to keep an account of the reference category of each of the dummy variable during the interpretation.
We can use max.col
to get the column index, subset the column names based on that and use sub
to remove the prefix.
sub('[^.]+\\.', '', names(dat)[max.col(dat)])
#[1] "White" "Asian" "White" "Black" "Asian" "Hispanic"
#[7] "White" "White" "White" "Black"
Here, I assumed that there is a single 1
per each row. If there are multiple 1s, we can use the option ties.method='first'
or ties.method='last'
.
Or another option is doing the %*%
with the sequence of columns, subset the column names, and remove the prefix with sub
.
sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])
Or we can use pmax
sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])
Another idea:
ff = function(x)
{
ans = integer(nrow(x))
for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
names(x)[ans]
}
sub("[^.]+\\.", "", ff(dat))
#[1] "White" "Asian" "White" "Black" "Asian" "Hispanic" "White" "White" "White" "Black"
And to compare with akrun's alternatives:
akrun1 = function(x) names(x)[max.col(x, "first")]
akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]
DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3),
cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)),
1L)),
paste("fac", 1:1e3, sep = ""))
identical(ff(DF), akrun1(DF))
#[1] TRUE
identical(ff(DF), akrun2(DF))
#[1] TRUE
identical(ff(DF), akrun3(DF))
#[1] TRUE
identical(ff(DF), akrunlike(DF))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF),
akrun3(DF), akrunlike(DF),
as.matrix(DF), col(DF), times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(DF) 61.99124 64.56194 78.62267 102.18424 152.64891 30
# akrun1(DF) 296.89042 314.28641 327.95059 353.46185 394.46013 30
# akrun2(DF) 103.76105 114.01497 120.12191 129.86513 166.13266 30
# akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346 30
# akrunlike(DF) 125.47542 130.20826 141.66123 157.92743 203.42331 30
# as.matrix(DF) 19.46940 20.54543 28.22377 35.69575 87.06001 30
# col(DF) 103.61454 112.75450 116.00120 126.09138 176.97435 30
I included as.matrix()
and col()
just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With