Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sample of rows with "at least one from each" condition

Tags:

r

dplyr

So I've got a dataset that looks something like this:

a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks

And I know I can do a random subset of rows with a nested conditional, but what I want to do is create a random subset that takes at least one of each available value in column 'c'.

So possible correct subsets would be

23  34  Falcons
14  9   Hawks
3   21  Eagles

or

11  4   Hawks
2   18  Eagles
22  8   Falcons

[in no particular order], but something like:

14  9   Hawks
2   18  Eagles
3   21  Eagles

would NOT work, because 'Falcons' is not represented. Is there an easy way to do this in R?

like image 573
skathan Avatar asked Mar 08 '16 01:03

skathan


2 Answers

Use group_by and sample_n functions in the dplyr package.

text1 <- "a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks"

dat <- read.table(text=text1, head=T, as.is=T)

library(dplyr)
dat %>% group_by(c) %>% sample_n(1)

# Source: local data frame [3 x 3]
# Groups: c [3]

#       a     b       c
#   (int) (int)   (chr)
# 1     3    21  Eagles
# 2    22     8 Falcons
# 3    11     4   Hawks

UPDATE: You can write a function to do the sampling.

sample_df <- function(df) {
  df.r <- sample(1:nrow(df), 1)
  return(sample_n(df, df.r))
}
dat %>% group_by(c) %>% do(sample_df(.))
like image 109
Ven Yao Avatar answered Nov 17 '22 00:11

Ven Yao


You can specify the n for each group here (use 1s if you only want a data frame with nrows == number of groups

dd <- read.table(header = TRUE, text = 'a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks', stringsAsFactors = FALSE)

(n <- setNames(c(1,2,1), unique(dd$c)))
# Falcons   Hawks  Eagles 
#       1       2       1 

set.seed(1)
dd[as.logical(ave(dd$c, dd$c, FUN = function(x)
  sample(rep(c(FALSE, TRUE), c(length(x) - n[x[1]], n[x[1]]))))), ]

#    a  b       c
# 1 23 34 Falcons
# 2 14  9   Hawks
# 4  3 21  Eagles
# 6 11  4   Hawks

Putting this into a function to automate some other things for you

sample_each <- function(data, var, n = 1L) {
  lvl <- table(data[, var])
  n1 <- setNames(rep_len(n, length(lvl)), names(lvl))
  n0 <- lvl - n1
  idx <- ave(as.character(data[, var]), data[, var], FUN = function(x)
    sample(rep(0:1, c(n0[x[1]], n1[x[1]]))))
  data[!!(as.numeric(idx)), ]
}

sample_each(dd, 'c', n = c(1,2,1))
#    a  b       c
# 1 23 34 Falcons
# 3  2 18  Eagles
# 5 22  8 Falcons
# 6 11  4   Hawks

sample_each(mtcars, 'gear', 1)
#                mpg cyl  disp  hp drat   wt  qsec vs am gear carb
# Valiant       18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
# Merc 280      19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
# Maserati Bora 15.0   8 301.0 335 3.54 3.57 14.60  0  1    5    8


sample_each(mtcars, 'gear', c(2,2,5))
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Porsche 914-2      26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Lotus Europa       30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# Ford Pantera L     15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
# Ferrari Dino       19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
# Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
# Mazda RX4 Wag1     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Hornet Sportabout1 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Merc 2801          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
like image 2
rawr Avatar answered Nov 17 '22 00:11

rawr