Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape cast compare to one level

Tags:

r

reshape

plyr

I often have data where I want to compare the value of one level of variable with all the other levels of variable. Each time I write code to do this I wish it were easier. Here's an example of the problem:

Suppose I want to compare the average cost of diamonds of any cut with the average cost of best cut diamonds. To make things fair I want to do this for each clarity, separately.

Let's check we have enough data:

> with(diamonds,table(cut,clarity))
           clarity
cut           I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF
  Fair       210  466  408  261  170   69   17    9
  Good        96 1081 1560  978  648  286  186   71
  Very Good   84 2100 3240 2591 1775 1235  789  268
  Premium    205 2949 3575 3357 1989  870  616  230
  Ideal      146 2598 4282 5071 3589 2606 2047 1212

okay no zeroes in Idea, so let's calculate the mean.

> claritycut<-ddply(diamonds,.(clarity,cut),summarize,price=mean(price))
> claritycut
   clarity       cut    price
1       I1      Fair 3703.533
2       I1      Good 3596.635
3       I1 Very Good 4078.226
4       I1   Premium 3947.332
5       I1     Ideal 4335.726
6      SI2      Fair 5173.916
7      SI2      Good 4580.261
8      SI2 Very Good 4988.688
9      SI2   Premium 5545.937
10     SI2     Ideal 4755.953
...

The end result I want is:

   clarity  variable     ratio
1       I1      Fair 0.8541899
2       I1      Good 0.8295348
3       I1 Very Good 0.9406098
4       I1   Premium 0.9104200
5       I1     Ideal 1.0000000
6      SI2      Fair 1.0878822
7      SI2      Good 0.9630586
8      SI2 Very Good 1.0489356
9      SI2   Premium 1.1661043
10     SI2     Ideal 1.0000000
...

But I'm not sure how to do this neatly. Most of the rest of this question concerns an intermediate step in the calculation - the divide.

Now I want to calculate the relative price of all cuts vs Ideals. Here's the data frame I'd expect to see partway through the calculation - extracting only one level of cut:

> claritycutideal <- join(subset(claritycut,cut!="Ideal"),summarize(subset(claritycut,cut=="Ideal"),Ideal=price,clarity))
> print(claritycutideal)
Joining by: clarity
   clarity       cut    price    Ideal
1       I1      Fair 3703.533 4335.726
2       I1      Good 3596.635 4335.726
3       I1 Very Good 4078.226 4335.726
4       I1   Premium 3947.332 4335.726
5      SI2      Fair 5173.916 4755.953
6      SI2      Good 4580.261 4755.953
7      SI2 Very Good 4988.688 4755.953
8      SI2   Premium 5545.937 4755.953
...

Which works, but it's fiddly to write the above statement, and I still need to finish the calculation off, mentioning the Ideal name again.

> mutate(claritycutideal,ratio=price/Ideal)

It feels like I want something like

> cast(claritycut,clarity~cut)
Using clarity, cut as id variables
  clarity     Fair     Good Very Good  Premium    Ideal
1      I1 3703.533 3596.635  4078.226 3947.332 4335.726
2     SI2 5173.916 4580.261  4988.688 5545.937 4755.953
3     SI1 4208.279 3689.533  3932.391 4455.269 3752.118
4     VS2 4174.724 4262.236  4215.760 4550.331 3284.550
...

This is totally unsuitable for the mean calculation, since I would need to know the names of all the recast levels in the calculation:

I'd like recast, but with a way to filter the levels extracted and leave the rest untouched, for instance:

> cast(claritycut,clarity~cut,subset=cut=="Ideal")

Which exists, but doesn't retain the unfiltered levels.

I would then need to melt it again, and while there's a recast, there's no remelt.

Does anyone have a neat trick to do this?

Or perhaps I'm looking at this completely the wrong way - do marginal calculations do this for me?


The following works exactly right but is fiddly:

> valuevars=function(x)x[!names(x)%in%attr(x,"idvars")]
> melt(ddply(cast(claritycut,clarity~cut),.(clarity),
             function(x)valuevars(x)/x$Ideal))
like image 833
Alex Brown Avatar asked Feb 24 '23 12:02

Alex Brown


1 Answers

I'm not sure this is neat enough, but there is a two liner:

# from your code
claritycut <- ddply(diamonds,.(clarity,cut),summarize,price=mean(price))

# 1 do that work
transform(merge(claritycut, subset(claritycut, cut=="Ideal"), by="clarity"),
  ratio = price.x / price.y)

# 2 another way
ddply(claritycut, .(clarity), 
      function(x) data.frame(cut=x$cut, 
                             rate=x$price / subset(x, cut == "Ideal")$price))

# 3 another way
ddply(claritycut, .(clarity), 
      summarize, cut=cut, rate=price / price[cut == "Ideal"])

and finally 4) here is a one-liner version:

ddply(diamonds, .(clarity), 
      function(x) transform(ddply(x, .(cut), 
                                  summarize, rate=mean(price)), 
                            rate=rate/mean(subset(x, cut=="Ideal")$price)))

but too complicated.

like image 64
kohske Avatar answered Feb 26 '23 22:02

kohske