Implementation of skyline query or efficient frontier

Question

I know there must be an easy answer to this but somehow I can't seem to find it...

I have a data frame with 2 numeric columns. I would like to remove from it, the rows, which have the property, that there exists at least one other row in the data frame, with both column values bigger than the ones in this row.

So if I have

    Col1 Col2  
1     2    3  
2     4    7  
3     5    6

I would like to remove the first row, because the second one fulfills the property and keep only rows 2 and 3.

Thanks a lot!

Vincent Zoonekynd · Accepted Answer

That problem is called a "skyline query" by database administrators (they may have other algorithms) and an "efficient frontier" by economists. Plotting the data can make it clear what we are looking for.

n <- 40
d <- data.frame(
  x = rnorm(n),
  y = rnorm(n)
)
# We want the "extreme" points in the following plot
par(mar=c(1,1,1,1))
plot(d, axes=FALSE, xlab="", ylab="")
for(i in 1:n) {
  polygon( c(-10,d$x[i],d$x[i],-10), c(-10,-10,d$y[i],d$y[i]), 
  col=rgb(.9,.9,.9,.2))
}

The algorithm is as follows: sort the points along the first coordinate, keep each observation unless it is worse than the last retained one.

d <- d[ order(d$x, decreasing=TRUE), ]
result <- d[1,]
for(i in seq_len(nrow(d))[-1] ) {
  if( d$y[i] > result$y[nrow(result)] ) {
    result <- rbind(result, d[i,])  # inefficient
  } 
}
points(result, cex=3, pch=15)

Skyline

Josh O'Brien · Answer

Edit (2015-03-02): For a more efficient solution, please see Patrick Roocks' rPref, a package for "Database Preferences and Skyline Computation", (also linked to in his answer below). To show that it finds the same solution as my code here, I've appended an example using it to my original answer here.

Riffing off of Vincent Zoonekynd's enlightening response, here's an algorithm that's fully vectorized, and likely more efficient:

set.seed(100)
d <- data.frame(x = rnorm(100), y = rnorm(100))

D   <- d[order(d$x, d$y, decreasing=TRUE), ]
res <- D[which(!duplicated(cummax(D$y))), ]
#             x         y
# 64  2.5819589 0.7946803
# 20  2.3102968 1.6151907
# 95 -0.5302965 1.8952759
# 80 -2.0744048 2.1686003


# And then, if you would prefer the rows to be in 
# their original order, just do:
d[sort(as.numeric(rownames(res))), ]
#            x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759

Or, using the rPref package:

library(rPref)
psel(d, high(x) | high(y))
#             x         y
# 20  2.3102968 1.6151907
# 64  2.5819589 0.7946803
# 80 -2.0744048 2.1686003
# 95 -0.5302965 1.8952759

G. Grothendieck · Answer

Here is an sqldf solution where DF is the data frame of data:

library(sqldf)
sqldf("select * from DF a
 where not exists (
   select * from DF b
   where b.Col1 >= a.Col1 and b.Col2 >  a.Col2  
      or b.Col1 >  a.Col1 and b.Col2 >= a.Col2
 )"
)

Patrick Roocks · Answer

This question is pretty old, but meanwhile there is a new solution. I hope it is ok to do some self-promotion here: I developed a package rPref which does an efficient Skyline computation due to C++ algorithms. With installed rPref package the query from the question can be done via (assuming that df is the name of data set):

library(rPref)
psel(df, high(Col1) | high(Col2))

This removes only those tuples, where some other tuple is better in both dimensions.

If one requires the other tuple to be strictly better in just one dimension (and better or equal in the other dimension), use high(Col1) * high(Col2) instead.

Implementation of skyline query or efficient frontier

Tags:

select

dataframe

r

user1184143

4 Answers

Vincent Zoonekynd

Josh O'Brien

G. Grothendieck

Patrick Roocks

Recent Activity

Donate For Us

Implementation of skyline query or efficient frontier

Tags:

select

dataframe

r

user1184143

4 Answers

Vincent Zoonekynd

Josh O'Brien

G. Grothendieck

Patrick Roocks

Related questions

Recent Activity

Donate For Us