I know there must be an easy answer to this but somehow I can't seem to find it...
I have a data frame with 2 numeric columns. I would like to remove from it, the rows, which have the property, that there exists at least one other row in the data frame, with both column values bigger than the ones in this row.
So if I have
Col1 Col2
1 2 3
2 4 7
3 5 6
I would like to remove the first row, because the second one fulfills the property and keep only rows 2 and 3.
Thanks a lot!
That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.
n <- 40
d <- data.frame(
x = rnorm(n),
y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]),
col=rgb(.9,.9,.9,.2))
}
The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.
d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
if( d$y[i] > result$y[nrow(result)] ) {
result <- rbind(result, d[i,]) # inefficient
}
}
points(result, cex=3, pch=15)
Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.
Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:
set.seed(100)
d <- data.frame(x = rnorm(100), y = rnorm(100))
D <- d[order(d$x, d$y, decreasing=TRUE), ]
res <- D[which(!duplicated(cummax(D$y))), ]
# x y
# 64 2.5819589 0.7946803
# 20 2.3102968 1.6151907
# 95 -0.5302965 1.8952759
# 80 -2.0744048 2.1686003
# And then, if you would prefer the rows to be in
# their original order, just do:
d[sort(as.numeric(rownames(res))), ]
# x y
# 20 2.3102968 1.6151907
# 64 2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759
Or, using the rPref package:
library(rPref)
psel(d, high(x) | high(y))
# x y
# 20 2.3102968 1.6151907
# 64 2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759
Here is an sqldf solution where DF
is the data frame of data:
library(sqldf)
sqldf("select * from DF a
where not exists (
select * from DF b
where b.Col1 >= a.Col1 and b.Col2 > a.Col2
or b.Col1 > a.Col1 and b.Col2 >= a.Col2
)"
)
This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df
is the name of data set):
library(rPref)
psel(df, high(Col1) | high(Col2))
This removes only those tuples, where some other tuple is better in both dimensions.
If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2)
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With