Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R selecting duplicate rows

Okay, I'm fairly new to R and I've tried to search the documentation for what I need to do but here is the problem.

I have a data.frame called heeds.data in the following form (some columns omitted for simplicity) eval.num, eval.count, ... fitness, fitness.mean, green.h.0, green.v.0, offset.0, green.h.1, green.v.1,...green.h.7, green.v.7, offset.7...

And I have selected a row meeting the following criteria:

best.fitness <- min(heeds.data$fitness.mean[heeds.data$eval.count >= 10])
best.row <- heeds.data[heeds.data$fitness.mean == best.fitness]

Now, what I want are all of the other rows with that have columns green.h.0 to offset.7 (a contiguous section of columns) equal to the best.row

I was thinking this might work

heeds.best <- heeds.data$fitness[
  heeds.data$green.h.0 == best.row$green.h.0 & ...
]

But with 24 columns it seems like a stupid method. Looking for something a bit simpler with less manual typing.

Here is a short data sample to show what I want

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0
1         1           1500     1500          100        120        40
2         2           1000     1250          100        120        40
3         3           1250     1250          100        120        40
4         4           1000     1187.5        100        120        40
5         1           2000     2000          200        100        40
6         1           3000     3000          150        90         10
7         1           2000     2000          90         90         100
8         2           1800     1900          90         90         100

Should select the "best" as row 4 Then I want to grab the results as follows

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0
1         1           1500     1500          100        120        40
2         2           1000     1250          100        120        40
3         3           1250     1250          100        120        40
4         4           1000     1187.5        100        120        40

Data isn't actually sorted and there are many more columns but that is the concept

Thanks!

like image 593
Matt Avatar asked Jun 11 '26 19:06

Matt


1 Answers

Your question is essentially just a complicated indexing question. I have a solution here though there may be simpler ones. I loaded your examples data into DF:

First, this gets us the best row index (easy using which.min()) :

R> bind <- which.min(DF[,"fitness.mean"])  # index of best row

Next, we apply() a row-wise comparison (over the subset of columns we care about, here index simply by position 5 to 7).

We use a comparison function cmpfun to compare the current row r to the best row (indexed by bind) and use all() to get rows where all elements correspond. [ We need drop=FALSE here to make it comparable on both sides, else as.numeric() helps. ]

R> cmpfun <- function(r) all(r == DF[bind,5:7,drop=FALSE])  # compare to row bind

This we simply apply this row-wise:

R> brows <- apply(DF[,5:7], 1, cmpfun)

And these are the rows we wanted:

R> DF[brows, ]
  eval.num eval.count fitness fitness.mean green.h.0 green.v.0 offset.0
1        1          1    1500         1500       100       120       40
2        2          2    1000         1250       100       120       40
3        3          3    1250         1250       100       120       40
4        4          4    1000         1188       100       120       40
R> 

It did not matter that we use three columns for comparison -- all that mattered is that we had an indexing expression (here 5:7) for the columns we wanted.

like image 112
Dirk Eddelbuettel Avatar answered Jun 14 '26 10:06

Dirk Eddelbuettel