Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract rows with highest and lowest values from a data frame

I'm quite new to R, I use it mainly for visualising statistics using ggplot2 library. Now I have faced a problem with data preparation.

I need to write a function, that will remove some number (2, 5 or 10) rows from a data frame that have highest and lowest values in specified column and put them into another data frame, and do this for each combination of two factors (in my case: for each day and server).

Up to this point, I have done the following steps (MWE using esoph example dataset).

I have sorted the frame according to the desired parameter (ncontrols in example):

esoph<-esoph[with(esoph,order(-ncontrols)) ,]

I can display first/last records for each factor value (in this example for each age range):

by(data=esoph,INDICES=esoph$agegp,FUN=head,3)
by(data=esoph,INDICES=esoph$agegp,FUN=tail,3)

So basically, I can see the highest and lowest values, but I don't know how to extract them into another data frame and how to remove them from the main one.

Also in the above example I can see top/bottom records for each value of one factor (age range), but in reality I need to know highest and lowest records for each value of two factors -- in this example they could be agegp and alcgp.

I am not even sure if these above steps are OK - perhaps using plyr would work better? I'd appreciate any hints.

like image 750
Paweł Rumian Avatar asked Oct 05 '22 23:10

Paweł Rumian


1 Answers

Yes, you can use plyr as follows:

esoph <- data.frame(agegp = sample(letters[1:2], 20, replace = TRUE),
                    alcgp = sample(LETTERS[1:2], 20, replace = TRUE),
                    ncontrols = runif(20))

ddply(esoph, c("agegp", "alcgp"),
      function(x){idx <- c(which.min(x$ncontrols),
                           which.max(x$ncontrols))
                  x[idx, , drop = FALSE]})
#   agegp alcgp  ncontrols
# 1     a     A 0.03091483
# 2     a     A 0.88529790
# 3     a     B 0.51265447
# 4     a     B 0.86111649
# 5     b     A 0.28372232
# 6     b     A 0.61698401
# 7     b     B 0.05618841
# 8     b     B 0.89346943

ddply(esoph, c("agegp", "alcgp"),
      function(x){idx <- c(which.min(x$ncontrols),
                           which.max(x$ncontrols))
                  x[-idx, , drop = FALSE]})
#    agegp alcgp ncontrols
# 1      a     A 0.3745029
# 2      a     B 0.7621474
# 3      a     B 0.6319013
# 4      b     A 0.3055078
# 5      b     A 0.5146028
# 6      b     B 0.3735615
# 7      b     B 0.2528612
# 8      b     B 0.4415205
# 9      b     B 0.6868219
# 10     b     B 0.3750102
# 11     b     B 0.2279462
# 12     b     B 0.1891052

There are possibly many alternatives, e.g. using head and tail if your data is already sorted, but this should work.

like image 78
flodel Avatar answered Oct 10 '22 02:10

flodel