Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling missing combinations of factors in R

Tags:

r

So, I have a data frame with two factors and one numeric variable like so:

>D
f1 f2 v1 
1   A  23
2   A  45
2   B  27
     .
     .
     .

so the levels of f1 are 1 and 2 and the levels of f2 are A and B. Here's the thing, there is no value inputted for when f1=1 and f2=B (that is D$V1[D$f1=1 & D$f2=B] isn't there) in reality this should be zero.

In my actual data frame I have 11 levels of f1 and close to 150 levels of f2 and I need to create an observation with v1=0 for every combination of f1 and f2 that is missing from my data frame.

How would I go about doing this?

Thanks in advance,

Ian

like image 461
user1443010 Avatar asked Jun 08 '12 18:06

user1443010


3 Answers

I add the tidyr solution, spreading with fill=0 and gathering.

library(tidyr)
df %>% spread(f2, v1, fill=0) %>% gather(f2, v1, -f1)

#  f1 f2 v1
#1  1  A 23
#2  2  A 45
#3  1  B  0
#4  2  B 27

You could equally do df %>% spread(f1, v1, fill=0) %>% gather(f1, v1, -f2).

like image 106
Joe Avatar answered Oct 28 '22 00:10

Joe


Using your data:

dat <- data.frame(f1 = factor(c(1,2,2)), f2 = factor(c("A","A","B")),
                  v1 = c(23,45,27))

one option is to create a lookup table with the combinations of levels, which is done using the expand.grid() function supplied with the levels of both factors, as shown below:

dat2 <- with(dat, expand.grid(f1 = levels(f1), f2 = levels(f2)))

A database-like join operation can then be performed using the merge() function in which we specify that all values from the lookup table are included in the join (all.y = TRUE)

newdat <- merge(dat, dat2, all.y = TRUE)

The above line produces:

> newdat
  f1 f2 v1
1  1  A 23
2  1  B NA
3  2  A 45
4  2  B 27

As you can see, the missing combinations are given the value NA indicating the missing-ness. It is realtively simple to then replace these NAs with 0s:

> newdat$v1[is.na(newdat$v1)] <- 0
> newdat
  f1 f2 v1
1  1  A 23
2  1  B  0
3  2  A 45
4  2  B 27
like image 22
Gavin Simpson Avatar answered Oct 28 '22 01:10

Gavin Simpson


Two years late, but I had the same problem and came up with this plyr solution:

dat <- data.frame(f1 = factor(c(1,2,2)), f2 = factor(c("A","A","B")), v1 = c(23,45,27))

newdat <- ddply(dat, .(f1,f2), numcolwise(function(x) {if(length(x)>0) x else 0.0}), .drop=F)

> newdat
  f1 f2 v1
1  1  A 23
2  1  B  0
3  2  A 45
4  2  B 27
like image 42
user28400 Avatar answered Oct 28 '22 01:10

user28400