I was looking at the help page for the aggregate
function in R. I had never used this convenience function but I have a process it should help me speed up. However, I've been totally unable to walk through the example and understand what is going on.
One example is the following:
1> aggregate(state.x77, list(Region = state.region), mean)
Region Population Income Illiteracy Life Exp Murder HS Grad Frost Area
1 Northeast 5495 4570 1.000 71.26 4.722 53.97 132.78 18141
2 South 4208 4012 1.738 69.71 10.581 44.34 64.62 54605
3 North Central 4803 4611 0.700 71.77 5.275 54.52 138.83 62652
4 West 2915 4703 1.023 71.23 7.215 62.00 102.15 134463
The output here is exactly what I would expect. So I try to understand what is going on. So I look at state.x77
1> head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
OK, that's odd to me. I would expect to see a column in state.x77 named state.region
or something. So state.region must be its own object. So I do a str() on it:
1> str(state.region)
Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
It looks like state.region is just a factor. Somehow there HAS to be a connection between state.region and state.x77 in order for aggregate() to group state.x77 by state.region. But that connection is a mystery to me. Can you help me fill in my obvious misunderstandings?
aggregate() function is used to get the summary statistics of the data by group. The statistics include mean, min, sum.
In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean ) on the third. An alternative is to specify a formula of the form: numerical ~ categorical .
For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. Similarly, it can be shown that min_N() and max_N() (which find the N minimum and N maximum values, respectively, in a given set) and standard_deviation() are algebraic aggregate functions.
From an old tampon (was it tampons?) commercial: "Proof, not only promises!"
state.x777 <- as.data.frame(state.x77)
state.x777 <- cbind(state.x777, stejt.ridzn = state.region)
aggregate(state.x77, list(Region = state.x777$stejt.ridzn), mean)
They are likely in the correct order as these objects are documented on the same help page ?state.x77
, which has:
Details:
R currently contains the following “state” data sets. Note that
all data are arranged according to alphabetical order of the state
names.
Try help(state.region)
etc --- they are all aligned:
Details:
R currently contains the following “state” data sets. Note that all data are arranged according to alphabetical order of the state names.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With