I have a data frame containing 10k rows, for a given column X I have duplicated values, How can we do to select randomly ONLY ONE ROW containing this value in this column ?
Your question is not entirely clear, but I'm assuming you want to subsample the entire data frame, keeping one (randomly chosen) row per "duplicate class". Something like
library(plyr)
subsampled_data <- ddply(mydata,.(X),
function(x) {
x[sample(nrow(x),size=1),]
})
Should work (not tested!)
My first instinct would have been something like Ben's elegant ddply
solution. However, knowing now that you have such a large data set, there are definitely faster ways. Here is one that will be many times faster if you have many unique values:
RemoveDups <- function(df, column) {
inds = sample(1:nrow(df))
df = df[inds, ]
dups = duplicated(df[, column])
df = df[!dups, ]
inds = inds[!dups]
df[sort(inds, index=T)$ix, ]
}
Simulate some data (here with many unique values):
n.row = 10^6
n.col = 3
set.seed(12345)
data = data.frame(matrix(sample(1000, n.row*n.col, replace=T), nrow=n.row))
Compare the 2 methods:
> system.time(ddply(data, 'X1', function(x) x[sample(nrow(x), size=1), ]))
user system elapsed
3.264 0.921 4.315
> system.time(RemoveDups(data, 'X1'))
user system elapsed
0.375 0.025 0.399
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With