Let's say you have the following data frame
set.seed(12345)
people <- data.frame(Name = paste("Name", 1:51),
Var1 = sample(c("A", "B"), 51, prob = c(0.3, 0.7), replace = TRUE),
Var2 = sample(1:2, 51, replace = TRUE))
table(people$Var1, people$Var2)
1 2
A 12 5
B 21 13
I would like to split the dataset into groups depending on certain criteria.
For example, I might want to divide the dataset into 9 groups, so that each one has at least 1 person with Var1 == 'A' and a roughly equal balance between 1 and 2 for Var2.
Obviously, an exact split is not possible so, in this example, I would allocate 5 people to each group and then allocate the rest randomly, in order to have either 5 or 6 people in each group.
Is there an efficient way of doing this?
PS: I am asking how to do this in R, as I already have these data in R, but a generic solution would be appreciated as well
A simple approach with dplyr:
Var1,Var2n=9 first people (with Var1 == 'A' because of order)Var2 and dispatch them in the groupslibrary(dplyr)
n <- 9
data <- people[sample(nrow(people),replace=F),] %>% arrange(Var1,Var2)
rbind(head(data, n) %>% mutate(grp = 1:n),
tail(data,-n) %>% arrange(Var2) %>%
mutate(grp = rep(1:n,length.out=nrow(people)-n))
) %>% split(.$grp)
$`1`
Name Var1 Var2 grp
1 Name 43 A 1 1
10 Name 31 A 1 1
19 Name 22 B 1 1
28 Name 21 B 1 1
37 Name 23 A 2 1
46 Name 46 B 2 1
$`2`
Name Var1 Var2 grp
2 Name 4 A 1 2
11 Name 51 A 1 2
20 Name 17 B 1 2
29 Name 14 B 1 2
38 Name 37 A 2 2
47 Name 49 B 2 2
$`3`
Name Var1 Var2 grp
3 Name 3 A 1 3
12 Name 13 A 1 3
21 Name 5 B 1 3
30 Name 36 B 1 3
39 Name 11 B 2 3
48 Name 33 B 2 3
$`4`
Name Var1 Var2 grp
4 Name 10 A 1 4
13 Name 15 B 1 4
22 Name 47 B 1 4
31 Name 8 B 1 4
40 Name 42 B 2 4
49 Name 19 B 2 4
$`5`
Name Var1 Var2 grp
5 Name 1 A 1 5
14 Name 7 B 1 5
23 Name 34 B 1 5
32 Name 35 B 1 5
41 Name 26 B 2 5
50 Name 30 B 2 5
$`6`
Name Var1 Var2 grp
6 Name 41 A 1 6
15 Name 50 B 1 6
24 Name 29 B 1 6
33 Name 16 B 1 6
42 Name 48 B 2 6
51 Name 27 B 2 6
$`7`
Name Var1 Var2 grp
7 Name 9 A 1 7
16 Name 28 B 1 7
25 Name 40 B 1 7
34 Name 44 A 2 7
43 Name 45 B 2 7
$`8`
Name Var1 Var2 grp
8 Name 38 A 1 8
17 Name 32 B 1 8
26 Name 39 B 1 8
35 Name 20 A 2 8
44 Name 25 B 2 8
$`9`
Name Var1 Var2 grp
9 Name 24 A 1 9
18 Name 18 B 1 9
27 Name 6 B 1 9
36 Name 2 A 2 9
45 Name 12 B 2 9
Here is an update which may fit your goal, where an algorithm like water-filling is applied to sample rows dynamically according to updated groups.
ngrp <- 9
dfa <- subset(people, Var1 == "A")
dfb <- subset(people, Var1 == "B")
dfa_gr <- transform(
dfa,
grp = ave(Var2, Var2, FUN = function(x) {
sample(
rep(seq(ngrp),
length.out = length(x)
), length(x)
)
})
)
lst <- split(subset(dfa_gr, select = -grp), dfa_gr$grp)
while (nrow(dfb) > 0) {
k <- which.min(sapply(lst, nrow))
tofill <- c(1:2)[which.min(table(factor(lst[[k]]$Var2, levels = 1:2)))]
vb <- subset(dfb, Var2 == tofill)
if (nrow(vb) > 0) {
rm <- sample(row.names(vb), 1)
} else {
rm <- sample(row.names(dfb), 1)
}
lst[[k]] <- rbind(lst[[k]], dfb[rm, ])
dfb <- dfb[row.names(dfb) != rm, ]
}
which gives
> lst
$`1`
Name Var1 Var2
2 Name 2 A 2
9 Name 9 A 1
51 Name 51 A 1
49 Name 49 B 2
21 Name 21 B 1
29 Name 29 B 1
$`2`
Name Var1 Var2
1 Name 1 A 1
13 Name 13 A 1
20 Name 20 A 2
19 Name 19 B 2
14 Name 14 B 1
32 Name 32 B 1
$`3`
Name Var1 Var2
10 Name 10 A 1
43 Name 43 A 1
44 Name 44 A 2
30 Name 30 B 2
36 Name 36 B 1
34 Name 34 B 1
$`4`
Name Var1 Var2
3 Name 3 A 1
23 Name 23 A 2
7 Name 7 B 1
45 Name 45 B 2
6 Name 6 B 1
17 Name 17 B 1
$`5`
Name Var1 Var2
37 Name 37 A 2
41 Name 41 A 1
40 Name 40 B 1
25 Name 25 B 2
16 Name 16 B 1
22 Name 22 B 1
$`6`
Name Var1 Var2
31 Name 31 A 1
42 Name 42 B 2
15 Name 15 B 1
48 Name 48 B 2
35 Name 35 B 1
8 Name 8 B 1
$`7`
Name Var1 Var2
24 Name 24 A 1
27 Name 27 B 2
28 Name 28 B 1
46 Name 46 B 2
50 Name 50 B 1
$`8`
Name Var1 Var2
38 Name 38 A 1
33 Name 33 B 2
18 Name 18 B 1
26 Name 26 B 2
39 Name 39 B 1
$`9`
Name Var1 Var2
4 Name 4 A 1
11 Name 11 B 2
47 Name 47 B 1
12 Name 12 B 2
5 Name 5 B 1
Here is an attempt to group rows randomly, which has at least one Var1==A in each group and tries to have close size among groups. However, I didn't get the meaning of this objective:
roughly equal balance between 1 and 2 for
Var2
You have uneven numbers of 1 and 2 so it seems difficulty to have even distribution of them. Or, could you explain a bit on that?
Below is one option, maybe close to your goal:
ngrp <- 9
z <- do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
with(people, split(people, Var1)),
function(v) {
v <- v[order(v$Var2), ]
transform(
v,
grp = sample(
rep(seq(ngrp),
length.out = nrow(v)
), nrow(v)
)
)
}
)
)
)
res <- with(z, split(z, grp))
which gives
> res
$`1`
Name Var1 Var2 grp
5 Name 10 A 1 1
7 Name 24 A 1 1
18 Name 5 B 1 1
19 Name 6 B 1 1
28 Name 22 B 1 1
48 Name 45 B 2 1
$`2`
Name Var1 Var2 grp
1 Name 1 A 1 2
12 Name 51 A 1 2
32 Name 34 B 1 2
34 Name 36 B 1 2
39 Name 11 B 2 2
41 Name 19 B 2 2
$`3`
Name Var1 Var2 grp
6 Name 13 A 1 3
17 Name 44 A 2 3
25 Name 17 B 1 3
29 Name 28 B 1 3
33 Name 35 B 1 3
49 Name 46 B 2 3
$`4`
Name Var1 Var2 grp
2 Name 3 A 1 4
14 Name 20 A 2 4
22 Name 14 B 1 4
35 Name 39 B 1 4
50 Name 48 B 2 4
51 Name 49 B 2 4
$`5`
Name Var1 Var2 grp
9 Name 38 A 1 5
15 Name 23 A 2 5
23 Name 15 B 1 5
27 Name 21 B 1 5
43 Name 26 B 2 5
44 Name 27 B 2 5
$`6`
Name Var1 Var2 grp
13 Name 2 A 2 6
16 Name 37 A 2 6
31 Name 32 B 1 6
37 Name 47 B 1 6
40 Name 12 B 2 6
47 Name 42 B 2 6
$`7`
Name Var1 Var2 grp
8 Name 31 A 1 7
10 Name 41 A 1 7
20 Name 7 B 1 7
30 Name 29 B 1 7
45 Name 30 B 2 7
46 Name 33 B 2 7
$`8`
Name Var1 Var2 grp
4 Name 9 A 1 8
11 Name 43 A 1 8
24 Name 16 B 1 8
38 Name 50 B 1 8
42 Name 25 B 2 8
$`9`
Name Var1 Var2 grp
3 Name 4 A 1 9
21 Name 8 B 1 9
26 Name 18 B 1 9
36 Name 40 B 1 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With