R selecting duplicate rows

Question

Okay, I'm fairly new to R and I've tried to search the documentation for what I need to do but here is the problem.

I have a data.frame called heeds.data in the following form (some columns omitted for simplicity) eval.num, eval.count, ... fitness, fitness.mean, green.h.0, green.v.0, offset.0, green.h.1, green.v.1,...green.h.7, green.v.7, offset.7...

And I have selected a row meeting the following criteria:

best.fitness <- min(heeds.data$fitness.mean[heeds.data$eval.count >= 10])
best.row <- heeds.data[heeds.data$fitness.mean == best.fitness]

Now, what I want are all of the other rows with that have columns green.h.0 to offset.7 (a contiguous section of columns) equal to the best.row

I was thinking this might work

heeds.best <- heeds.data$fitness[
  heeds.data$green.h.0 == best.row$green.h.0 & ...
]

But with 24 columns it seems like a stupid method. Looking for something a bit simpler with less manual typing.

Here is a short data sample to show what I want

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0
1         1           1500     1500          100        120        40
2         2           1000     1250          100        120        40
3         3           1250     1250          100        120        40
4         4           1000     1187.5        100        120        40
5         1           2000     2000          200        100        40
6         1           3000     3000          150        90         10
7         1           2000     2000          90         90         100
8         2           1800     1900          90         90         100

Should select the "best" as row 4 Then I want to grab the results as follows

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0
1         1           1500     1500          100        120        40
2         2           1000     1250          100        120        40
3         3           1250     1250          100        120        40
4         4           1000     1187.5        100        120        40

Data isn't actually sorted and there are many more columns but that is the concept

Thanks!

Dirk Eddelbuettel · Accepted Answer

Your question is essentially just a complicated indexing question. I have a solution here though there may be simpler ones. I loaded your examples data into DF:

First, this gets us the best row index (easy using which.min()) :

R> bind <- which.min(DF[,"fitness.mean"])  # index of best row

Next, we apply() a row-wise comparison (over the subset of columns we care about, here index simply by position 5 to 7).

We use a comparison function cmpfun to compare the current row r to the best row (indexed by bind) and use all() to get rows where all elements correspond. [ We need drop=FALSE here to make it comparable on both sides, else as.numeric() helps. ]

R> cmpfun <- function(r) all(r == DF[bind,5:7,drop=FALSE])  # compare to row bind

This we simply apply this row-wise:

R> brows <- apply(DF[,5:7], 1, cmpfun)

And these are the rows we wanted:

R> DF[brows, ]
  eval.num eval.count fitness fitness.mean green.h.0 green.v.0 offset.0
1        1          1    1500         1500       100       120       40
2        2          2    1000         1250       100       120       40
3        3          3    1250         1250       100       120       40
4        4          4    1000         1188       100       120       40
R>

It did not matter that we use three columns for comparison -- all that mattered is that we had an indexing expression (here 5:7) for the columns we wanted.

R selecting duplicate rows

Tags:

select

dataframe

r

duplicates

statistics

Matt

1 Answers

Dirk Eddelbuettel

Recent Activity

Donate For Us

R selecting duplicate rows

Tags:

select

dataframe

r

duplicates

statistics

Matt

1 Answers

Dirk Eddelbuettel

Related questions

Recent Activity

Donate For Us