Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly select on Data Frame, for unique rows

Tags:

r

I have a data frame containing 10k rows, for a given column X I have duplicated values, How can we do to select randomly ONLY ONE ROW containing this value in this column ?

like image 561
Rad Avatar asked Nov 07 '11 19:11

Rad


2 Answers

Your question is not entirely clear, but I'm assuming you want to subsample the entire data frame, keeping one (randomly chosen) row per "duplicate class". Something like

library(plyr)
subsampled_data <- ddply(mydata,.(X),
    function(x) {
          x[sample(nrow(x),size=1),]
    })

Should work (not tested!)

like image 66
Ben Bolker Avatar answered Oct 04 '22 00:10

Ben Bolker


My first instinct would have been something like Ben's elegant ddply solution. However, knowing now that you have such a large data set, there are definitely faster ways. Here is one that will be many times faster if you have many unique values:

RemoveDups <- function(df, column) {
  inds = sample(1:nrow(df))  
  df   = df[inds, ]

  dups = duplicated(df[, column])
  df   = df[!dups, ]
  inds = inds[!dups]

  df[sort(inds, index=T)$ix, ]
}

Simulate some data (here with many unique values):

n.row = 10^6
n.col = 3

set.seed(12345)
data  = data.frame(matrix(sample(1000, n.row*n.col, replace=T), nrow=n.row))

Compare the 2 methods:

> system.time(ddply(data, 'X1', function(x) x[sample(nrow(x), size=1), ]))
   user  system elapsed 
  3.264   0.921   4.315 
> system.time(RemoveDups(data, 'X1'))
   user  system elapsed 
  0.375   0.025   0.399 
like image 37
John Colby Avatar answered Oct 03 '22 23:10

John Colby