Randomly select on Data Frame, for unique rows

Question

I have a data frame containing 10k rows, for a given column X I have duplicated values, How can we do to select randomly ONLY ONE ROW containing this value in this column ?

Ben Bolker · Accepted Answer

Your question is not entirely clear, but I'm assuming you want to subsample the entire data frame, keeping one (randomly chosen) row per "duplicate class". Something like

library(plyr)
subsampled_data <- ddply(mydata,.(X),
    function(x) {
          x[sample(nrow(x),size=1),]
    })

Should work (not tested!)

John Colby · Answer

My first instinct would have been something like Ben's elegant ddply solution. However, knowing now that you have such a large data set, there are definitely faster ways. Here is one that will be many times faster if you have many unique values:

RemoveDups <- function(df, column) {
  inds = sample(1:nrow(df))  
  df   = df[inds, ]

  dups = duplicated(df[, column])
  df   = df[!dups, ]
  inds = inds[!dups]

  df[sort(inds, index=T)$ix, ]
}

Simulate some data (here with many unique values):

n.row = 10^6
n.col = 3

set.seed(12345)
data  = data.frame(matrix(sample(1000, n.row*n.col, replace=T), nrow=n.row))

Compare the 2 methods:

> system.time(ddply(data, 'X1', function(x) x[sample(nrow(x), size=1), ]))
   user  system elapsed 
  3.264   0.921   4.315 
> system.time(RemoveDups(data, 'X1'))
   user  system elapsed 
  0.375   0.025   0.399

Randomly select on Data Frame, for unique rows

Tags:

r

Rad

2 Answers

Ben Bolker

John Colby

Recent Activity

Donate For Us

Randomly select on Data Frame, for unique rows

Tags:

r

Rad

2 Answers

Ben Bolker

John Colby

Related questions

Recent Activity

Donate For Us