Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce a data frame keeping the order for other columns

I am trying to reduce a data frame using the max function on a given column. I would like to preserve other columns but keeping the values from the same rows where each maximum value was selected. An example will make this explanation easier.

Let us assume we have the following data frame:

dframe <- data.frame(list(BENCH=sort(rep(letters[1:4], 4)),
                          CFG=rep(1:4, 4),
                          VALUE=runif(4 * 4)
                         ))

This gives me:

   BENCH CFG      VALUE
1      a   1 0.98828096
2      a   2 0.19630597
3      a   3 0.83539540
4      a   4 0.90988296
5      b   1 0.01191147
6      b   2 0.35164194
7      b   3 0.55094787
8      b   4 0.20744004
9      c   1 0.49864470
10     c   2 0.77845408
11     c   3 0.25278871
12     c   4 0.23440847
13     d   1 0.29795494
14     d   2 0.91766057
15     d   3 0.68044728
16     d   4 0.18448748

Now, I want to reduce the data in order to select the maximum VALUE for each different BENCH:

aggregate(VALUE ~ BENCH, dframe, FUN=max)

This gives me the expected result:

  BENCH     VALUE
1     a 0.9882810
2     b 0.5509479
3     c 0.7784541
4     d 0.9176606

Next, I tried to preserve other columns:

aggregate(cbind(VALUE, CFG) ~ BENCH, dframe, FUN=max)

This reduction returns:

  BENCH     VALUE CFG
1     a 0.9882810   4
2     b 0.5509479   4
3     c 0.7784541   4
4     d 0.9176606   4

Both VALUE and CFG are reduced using max function. But this is not what I want. For instance, in this example I would like to obtain:

  BENCH     VALUE CFG
1     a 0.9882810   1
2     b 0.5509479   3
3     c 0.7784541   2
4     d 0.9176606   2

where CFG is not reduced, but it just keeps the value associated to the maximum VALUE for each different BENCH.

How could I change my reduction in order to obtain the last result shown?

like image 325
betabandido Avatar asked Feb 20 '23 04:02

betabandido


2 Answers

If your problem scales to large data (millions or 10s of millions of rows and groups), then package data.table may be of interest. Here's the relevant syntax:

require(data.table)
dtable <- data.table(dframe)
dtable[, .SD[which.max(VALUE),], by = BENCH]
like image 88
Chase Avatar answered Feb 22 '23 18:02

Chase


Here's a base R solution:

do.call(rbind, by(dframe, dframe$BENCH, FUN=function(X) X[which.max(X$VALUE),]))
#   BENCH CFG     VALUE
# a     a   1 0.9882810
# b     b   3 0.5509479
# c     c   2 0.7784541
# d     d   2 0.9176606
like image 28
Josh O'Brien Avatar answered Feb 22 '23 18:02

Josh O'Brien