Can dplyr
perform chained summarise
operations on a data.frame
?
My data.frame has the structure:
data_df = tbl_df(data)
data_df %.%
group_by(col_1) %.%
summarise(number_of= length(col_2)) %.%
summarise(sum_of = sum(col_3))
This causes RStudio to encounter a fatal error - R Session Aborted
message
Usually with plyr
I would include these summarise
functions without problems.
UPDATE
Data are here.
Code is:
library(dplyr)
orth <- read.csv('orth0106.csv')
orth_df = tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure)) %.%
summarise(SSIs = sum(SSI))
I can reproduce the error on Windows 7 machine running RStudio 0.97.551
It may be because you're calling summarise
and chaining onto something that's not there. You can summarise
with 2 different columns as I've done here.
url <- "https://raw.github.com/johnmarquess/some.data/master/orth0106.csv"
library(dplyr)
orth <- read.csv(url)
orth_df <- tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure), SSIs = sum(SSI))
## Source: local data frame [18 x 3]
##
## Hospital Procs SSIs
## 1 A 865 80
## 2 B 1069 38
## 3 C 796 24
## 4 D 891 35
## 5 E 997 39
## 6 F 550 30
## 7 G 2598 128
## 8 H 373 27
## 9 I 1079 70
## 10 J 714 30
## 11 K 477 30
## 12 L 227 2
## 13 M 125 6
## 14 N 589 38
## 15 O 292 3
## 16 P 149 9
## 17 Q 1984 52
## 18 R 351 13
In any event this seems like either an RStudio or a dplyr
bug. I'd open up an issue with Hadley as he probably cares either way. https://github.com/hadley/dplyr/issues
EDIT This (your first call) also cause rgui (windows) and the terminal to crash as well on:
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
This indicates a dplyr
problem Hadley and Romain will want to know about.
To get my first point we run:
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure))
Source: local data frame [18 x 2]
Hospital Procs
1 A 865
2 B 1069
3 C 796
4 D 891
5 E 997
6 F 550
7 G 2598
8 H 373
9 I 1079
10 J 714
11 K 477
12 L 227
13 M 125
14 N 589
15 O 292
16 P 149
17 Q 1984
18 R 351
Where is %.% summarise(SSIs = sum(SSI))
supposed to find SSI
?
So the chaining you think is happening fails. TO my understanding %.%
isn't exactly like how ggplot2
works but similar. In ggplot2
once you pass the data in the initial mapping you can access it later on. Here %.% seems to modify grab the left chunk and operate on it like this:
So you're grabbing:
Hospital Procs
1 A 865
2 B 1069
3 C 796
.
.
.
17 Q 1984
18 R 351
when you use %.% summarise(SSIs = sum(SSI))
and there is no SSI
to be gotten. So the analogy that comes to mind is serial vs. parallel wiring Christmas lights. %.% = serial
ggplot() + = parallel
. This is a nonprogrammer's understanding of things and the R gurus may come and tell me I'm stupid but for now that's the best theory you've got.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With