Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spread vs dcast

Tags:

r

tidyr

reshape2

I have a table like this,

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91.63237     pred   lwr 73.33198
3   72.6  186.5 84.55326     pred   lwr 66.31751
4   78.8  187.2 85.26117     pred   lwr 67.02004
5   74.8  181.5 79.49675     pred   lwr 61.29244
6   86.4  184.0 82.02501     pred   lwr 63.80652

I want it to have like this,

> head(reshape2::dcast(dt2, 
         Weight + Height + Fitted + interval ~ limit, 
         fun.aggregate = mean))
  Weight Height   Fitted interval      lwr      upr
1   42.0  153.4 51.07920     conf 49.15463 53.00376
2   42.0  153.4 51.07920     pred 32.82122 69.33717
3   43.2  160.0 57.75378     conf 56.35240 59.15516
4   43.2  160.0 57.75378     pred 39.54352 75.96404
5   44.8  149.5 47.13512     conf 44.87642 49.39382
6   44.8  149.5 47.13512     pred 28.83891 65.43133

But using tidyr::spread, How can I do that?

I was using,

> tidyr::spread(dt2, limit, value)

But getting the error,

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)

Random 10 Rows::

> dt[sample(nrow(dt), 10), ]
     Weight Height   Fitted interval limit    value
1253   52.2  162.5 60.28203     conf   upr 61.51087
426    49.1  158.8 56.54022     pred   upr 74.75756
1117   78.4  184.5 82.53066     conf   lwr 80.98778
1171   85.9  166.4 64.22611     conf   lwr 63.21254
948    61.4  177.8 75.75494     conf   lwr 74.66393
384    90.9  172.7 70.59731     pred   lwr 52.41828
289    75.9  172.7 70.59731     pred   lwr 52.41828
3      44.8  149.5 47.13512     pred   lwr 28.83891
774    87.3  182.9 80.91258     pred   upr 99.12445
772    86.4  175.3 73.22669     pred   upr 91.40919
like image 994
TheRimalaya Avatar asked Feb 05 '16 13:02

TheRimalaya


2 Answers

Let's say you were starting with data that looked like this:

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

This matches the expected output from dcast with fun.aggregate = mean.

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

Sample data:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")   
like image 76
2 revs, 2 users 99% Avatar answered Oct 12 '22 23:10

2 revs, 2 users 99%


Here are data.table alternatives to dplyr. Use mydf from Ananda's answer.

library(data.table)
library(magrittr)
library(tidyr)

DT <- data.table(mydf)

First, you can use by to compute the mean by each limit.

DT[, .(lwr = mean(value[limit == "lwr"]), 
       upr = mean(value[limit == "upr"])), 
   by = .(Weight, Height, Fitted, interval)]

If this limit == ... looks too much hard coding, you can first aggregate into a long format, then spread. This works because once you aggregate, there is no duplicate.

DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
  spread(key = "limit", value = "value")

Both gets you

#   Weight Height  Fitted interval      lwr      upr
#1:     42  153.4 51.0792     conf 54.57731 53.00376
#2:     42  153.4 51.0792     pred 61.41061 69.33717
like image 32
Kota Mori Avatar answered Oct 13 '22 01:10

Kota Mori