Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transforming Data Frame in R

Tags:

r

I have a data frame with multiple variables which in turn have multiple categories. I'll like to take each category and convert them to indicator variables.

V1 V2 V3 V4
xc ab ty ky
xc ab ty kj
xc yi tf kj
cv yi tf kj
bg yt tg kl
bg yu yu kl

convert to

xc cv bg .....
T  F  F......
T  F  F....
T  F  F....
F  T  F....
F  F  T...
F  F  T....

i tried

newframe <- transform(oldframe, xc = to_column(oldframe$V1,'xc')) 

where to column is

to_column = function(col, val){
    if (col == val)
        'TRUE'  else
        'FALSE' }
like image 628
kogilvie Avatar asked Mar 30 '11 19:03

kogilvie


4 Answers

This is one standard approach to creating dummy varaibles from a categorical variable:

model.matrix( ~ V1 - 1, data=df)

df is your data.frame as shown in your question. This returns 0/1 binary as your FALSE/TRUE. Hope that helps!

Best regards,

Jay

like image 196
Jay Avatar answered Nov 02 '22 19:11

Jay


Building on @Jay's answer, we have this as a logical matrix.

Logical matrix version:

out <- model.matrix( ~ V1 - 1, data=dat)
out <- matrix(as.logical(out), ncol = ncol(out))
colnames(out) <- with(dat, levels(V1))

> out
        bg    cv    xc
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

All variables at once version:

out2 <- sapply(dat, function(x) model.matrix( ~ x - 1))
out2 <- do.call(cbind, out2)
out2 <- matrix(as.logical(out2), ncol = ncol(out2))
colnames(out2) <- unlist(sapply(dat, levels))

> out2
        bg    cv    xc    ab    yi    yt    yu    tf    tg    ty
[1,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
[5,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
        yu    kj    kl    ky
[1,] FALSE FALSE FALSE  TRUE
[2,] FALSE  TRUE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,]  TRUE FALSE  TRUE FALSE

If you don't want this as a full matrix like above, then you can stop with the first line, which has all the model matrices in a list, one for each variable (column) in dat, and convert the to a logical. This one-liner does both steps:

> lapply(lapply(dat, function(x) model.matrix( ~ x - 1)),
+        function(x) matrix(as.logical(x), ncol = ncol(x)))
$V1
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE
[4,] FALSE  TRUE FALSE
[5,]  TRUE FALSE FALSE
[6,]  TRUE FALSE FALSE

$V2
      [,1]  [,2]  [,3]  [,4]
[1,]  TRUE FALSE FALSE FALSE
[2,]  TRUE FALSE FALSE FALSE
[3,] FALSE  TRUE FALSE FALSE
[4,] FALSE  TRUE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V3
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE  TRUE FALSE
[2,] FALSE FALSE  TRUE FALSE
[3,]  TRUE FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE FALSE
[5,] FALSE  TRUE FALSE FALSE
[6,] FALSE FALSE FALSE  TRUE

$V4
      [,1]  [,2]  [,3]
[1,] FALSE FALSE  TRUE
[2,]  TRUE FALSE FALSE
[3,]  TRUE FALSE FALSE
[4,]  TRUE FALSE FALSE
[5,] FALSE  TRUE FALSE
[6,] FALSE  TRUE FALSE

And if the variable names are important, then we can modify this to

foo <- function(x) {
    mat <- matrix(as.logical(x), ncol = ncol(x))
    colnames(mat) <- levels(x)
    mat
}
lapply(lapply(dat, function(x) model.matrix( ~ x - 1)), foo)
like image 28
Gavin Simpson Avatar answered Nov 02 '22 20:11

Gavin Simpson


You could have a look at the reshape package, it provides functionality to pivot data like this. There are examples of its use at the author's homepage

like image 1
John Avatar answered Nov 02 '22 19:11

John


This is quite straightforward with mtabulate from the "qdap" package:

library(qdap)
mtabulate(split(mydf, 1:nrow(mydf))) > 0
#      ab    bg    cv    kj    kl    ky    tf    tg    ty    xc    yi
# 1  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# 2  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
# 3 FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
# 4 FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
# 5 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
# 6 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#      yt    yu
# 1 FALSE FALSE
# 2 FALSE FALSE
# 3 FALSE FALSE
# 4 FALSE FALSE
# 5  TRUE FALSE
# 6 FALSE  TRUE

By default, mtabulate would tabulate the results (surprise!) so the result would be a numeric data.frame. You'll see, for instance, that the count of "yu" in row 6 is actually 2. To get the logical output you desire (just presence/absence), just compare the values obtained from mtabulate with zero.

like image 1
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 02 '22 19:11

A5C1D2H2I1M1N2O1R2T1