I'm learning dplyr, having come from plyr, and I want to generate (per group) columns (per interaction) from the output of xtabs.
Short summary: I'm getting
A B
1 NA
NA 2
when I wanted
A B
1 2
xtabs data looks like this:
> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T)))
A
P FALSE TRUE
FALSE 1 2
TRUE 1 1
now do(
wants it's data in data frames, like this:
> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% as.data.frame
P A Freq
1 FALSE FALSE 1
2 TRUE FALSE 1
3 FALSE TRUE 2
4 TRUE TRUE 1
Now I want a single row output with columns being the interaction of levels. Here's what I'm looking for:
FALSE_FALSE TRUE_TRUE FALSE_TRUE TRUE_FALSE
1 1 2 1
But instead I get
> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>%
as.data.frame %>%
unite(S,A,P) %>%
spread(S,Freq)
FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 1 NA NA NA
2 NA 1 NA NA
3 NA NA 2 NA
4 NA NA NA 1
I'm clearly misunderstanding something here. I'm looking for the equivalent of reshape2's code here (using magrittr pipes for consistency):
> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>%
as.data.frame %>% # can be omitted. (safely??)
melt %>%
mutate(S=interaction(P,A),value=value) %>%
dcast(NA~S)
Using P, A as id variables
NA FALSE.FALSE TRUE.FALSE FALSE.TRUE TRUE.TRUE
1 NA 1 1 2 1
(note NA is used here because I don't have a grouping variable in this simplified example)
Update - interestingly, adding a single grouping column seems to fix this - why does it synthesise (presumably from row_name) a grouping column without me telling it?
> xtabs(data=data.frame(h="foo",P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>%
as.data.frame %>%
unite(S,A,P) %>%
spread(S,Freq)
h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 foo 1 1 2 1
This seems like a partial solution.
The key here is that spread
doesn't aggregate the data.
Hence, if you hadn't already used xtabs
to aggregate first, you would be doing this:
a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>%
unite(S,A,P)
a
## S Freq
## 1 FALSE_FALSE 1
## 2 FALSE_TRUE 1
## 3 TRUE_FALSE 1
## 4 TRUE_TRUE 1
## 5 TRUE_FALSE 1
a %>% spread(S, Freq)
## FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 1 NA NA NA
## 2 NA 1 NA NA
## 3 NA NA 1 NA
## 4 NA NA NA 1
## 5 NA NA 1 NA
Which wouldn't make sense any other way (without aggregation).
This is predictable based on the help file for the fill
parameter:
If there isn't a value for every combination of the other variables and the key column, this value will be substituted.
In your case, there aren't any other variables to combine with the key column. Had there been, then...
b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
, h = rep(c("foo", "bar"), length.out = 5)) %>%
unite(S,A,P)
b
## S Freq h
## 1 FALSE_FALSE 1 foo
## 2 FALSE_TRUE 1 bar
## 3 TRUE_FALSE 1 foo
## 4 TRUE_TRUE 1 bar
## 5 TRUE_FALSE 1 foo
> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)
...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).
The tidyr
/dplyr
way to do it would be group_by
and summarize
instead of xtabs
, because summarize
preserves the grouping column, hence spread
can tell which observations belong in the same row:
b %>% group_by(h, S) %>%
summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
##
## h S Freq
## 1 bar FALSE_TRUE 1
## 2 bar TRUE_TRUE 1
## 3 foo FALSE_FALSE 1
## 4 foo TRUE_FALSE 2
b %>% group_by(h, S) %>%
summarize(Freq = sum(Freq)) %>%
spread(S, Freq)
## Source: local data frame [2 x 5]
##
## h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar NA 1 NA 1
## 2 foo 1 NA 2 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With