Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting dataframe depending on conditions

Tags:

split

dataframe

r

Let's say you have the following data frame

set.seed(12345)
people <- data.frame(Name = paste("Name", 1:51),
                     Var1 = sample(c("A", "B"), 51, prob = c(0.3, 0.7), replace = TRUE),
                     Var2 = sample(1:2, 51, replace = TRUE))
table(people$Var1, people$Var2)

     1  2
  A 12  5
  B 21 13

I would like to split the dataset into groups depending on certain criteria.

For example, I might want to divide the dataset into 9 groups, so that each one has at least 1 person with Var1 == 'A' and a roughly equal balance between 1 and 2 for Var2.

Obviously, an exact split is not possible so, in this example, I would allocate 5 people to each group and then allocate the rest randomly, in order to have either 5 or 6 people in each group.

Is there an efficient way of doing this?

PS: I am asking how to do this in R, as I already have these data in R, but a generic solution would be appreciated as well

like image 630
nico Avatar asked May 19 '26 03:05

nico


2 Answers

A simple approach with dplyr:

  • first order the dataset randomly
  • then order by Var1,Var2
  • select n=9 first people (with Var1 == 'A' because of order)
  • sort the remaining people by Var2 and dispatch them in the groups
library(dplyr)

n <- 9

data <- people[sample(nrow(people),replace=F),] %>% arrange(Var1,Var2)

rbind(head(data, n) %>% mutate(grp = 1:n),
      tail(data,-n) %>% arrange(Var2) %>%
                        mutate(grp = rep(1:n,length.out=nrow(people)-n))
     ) %>% split(.$grp)

$`1`
      Name Var1 Var2 grp
1  Name 43    A    1   1
10 Name 31    A    1   1
19 Name 22    B    1   1
28 Name 21    B    1   1
37 Name 23    A    2   1
46 Name 46    B    2   1

$`2`
      Name Var1 Var2 grp
2   Name 4    A    1   2
11 Name 51    A    1   2
20 Name 17    B    1   2
29 Name 14    B    1   2
38 Name 37    A    2   2
47 Name 49    B    2   2

$`3`
      Name Var1 Var2 grp
3   Name 3    A    1   3
12 Name 13    A    1   3
21  Name 5    B    1   3
30 Name 36    B    1   3
39 Name 11    B    2   3
48 Name 33    B    2   3

$`4`
      Name Var1 Var2 grp
4  Name 10    A    1   4
13 Name 15    B    1   4
22 Name 47    B    1   4
31  Name 8    B    1   4
40 Name 42    B    2   4
49 Name 19    B    2   4

$`5`
      Name Var1 Var2 grp
5   Name 1    A    1   5
14  Name 7    B    1   5
23 Name 34    B    1   5
32 Name 35    B    1   5
41 Name 26    B    2   5
50 Name 30    B    2   5

$`6`
      Name Var1 Var2 grp
6  Name 41    A    1   6
15 Name 50    B    1   6
24 Name 29    B    1   6
33 Name 16    B    1   6
42 Name 48    B    2   6
51 Name 27    B    2   6

$`7`
      Name Var1 Var2 grp
7   Name 9    A    1   7
16 Name 28    B    1   7
25 Name 40    B    1   7
34 Name 44    A    2   7
43 Name 45    B    2   7

$`8`
      Name Var1 Var2 grp
8  Name 38    A    1   8
17 Name 32    B    1   8
26 Name 39    B    1   8
35 Name 20    A    2   8
44 Name 25    B    2   8

$`9`
      Name Var1 Var2 grp
9  Name 24    A    1   9
18 Name 18    B    1   9
27  Name 6    B    1   9
36  Name 2    A    2   9
45 Name 12    B    2   9
like image 51
Waldi Avatar answered May 21 '26 22:05

Waldi


Update

Here is an update which may fit your goal, where an algorithm like water-filling is applied to sample rows dynamically according to updated groups.

ngrp <- 9
dfa <- subset(people, Var1 == "A")
dfb <- subset(people, Var1 == "B")

dfa_gr <- transform(
  dfa,
  grp = ave(Var2, Var2, FUN = function(x) {
    sample(
      rep(seq(ngrp),
        length.out = length(x)
      ), length(x)
    )
  })
)

lst <- split(subset(dfa_gr, select = -grp), dfa_gr$grp)

while (nrow(dfb) > 0) {
  k <- which.min(sapply(lst, nrow))
  tofill <- c(1:2)[which.min(table(factor(lst[[k]]$Var2, levels = 1:2)))]
  vb <- subset(dfb, Var2 == tofill)
  if (nrow(vb) > 0) {
    rm <- sample(row.names(vb), 1)
  } else {
    rm <- sample(row.names(dfb), 1)
  }
  lst[[k]] <- rbind(lst[[k]], dfb[rm, ])
  dfb <- dfb[row.names(dfb) != rm, ]
}

which gives

> lst
$`1`
      Name Var1 Var2
2   Name 2    A    2
9   Name 9    A    1
51 Name 51    A    1
49 Name 49    B    2
21 Name 21    B    1
29 Name 29    B    1

$`2`
      Name Var1 Var2
1   Name 1    A    1
13 Name 13    A    1
20 Name 20    A    2
19 Name 19    B    2
14 Name 14    B    1
32 Name 32    B    1

$`3`
      Name Var1 Var2
10 Name 10    A    1
43 Name 43    A    1
44 Name 44    A    2
30 Name 30    B    2
36 Name 36    B    1
34 Name 34    B    1

$`4`
      Name Var1 Var2
3   Name 3    A    1
23 Name 23    A    2
7   Name 7    B    1
45 Name 45    B    2
6   Name 6    B    1
17 Name 17    B    1

$`5`
      Name Var1 Var2
37 Name 37    A    2
41 Name 41    A    1
40 Name 40    B    1
25 Name 25    B    2
16 Name 16    B    1
22 Name 22    B    1

$`6`
      Name Var1 Var2
31 Name 31    A    1
42 Name 42    B    2
15 Name 15    B    1
48 Name 48    B    2
35 Name 35    B    1
8   Name 8    B    1

$`7`
      Name Var1 Var2
24 Name 24    A    1
27 Name 27    B    2
28 Name 28    B    1
46 Name 46    B    2
50 Name 50    B    1

$`8`
      Name Var1 Var2
38 Name 38    A    1
33 Name 33    B    2
18 Name 18    B    1
26 Name 26    B    2
39 Name 39    B    1

$`9`
      Name Var1 Var2
4   Name 4    A    1
11 Name 11    B    2
47 Name 47    B    1
12 Name 12    B    2
5   Name 5    B    1

Here is an attempt to group rows randomly, which has at least one Var1==A in each group and tries to have close size among groups. However, I didn't get the meaning of this objective:

roughly equal balance between 1 and 2 for Var2

You have uneven numbers of 1 and 2 so it seems difficulty to have even distribution of them. Or, could you explain a bit on that?


Below is one option, maybe close to your goal:

ngrp <- 9
z <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      with(people, split(people, Var1)),
      function(v) {
        v <- v[order(v$Var2), ]
        transform(
          v,
          grp = sample(
            rep(seq(ngrp),
              length.out = nrow(v)
            ), nrow(v)
          )
        )
      }
    )
  )
)
res <- with(z, split(z, grp))

which gives

> res
$`1`
      Name Var1 Var2 grp
5  Name 10    A    1   1
7  Name 24    A    1   1
18  Name 5    B    1   1
19  Name 6    B    1   1
28 Name 22    B    1   1
48 Name 45    B    2   1

$`2`
      Name Var1 Var2 grp
1   Name 1    A    1   2
12 Name 51    A    1   2
32 Name 34    B    1   2
34 Name 36    B    1   2
39 Name 11    B    2   2
41 Name 19    B    2   2

$`3`
      Name Var1 Var2 grp
6  Name 13    A    1   3
17 Name 44    A    2   3
25 Name 17    B    1   3
29 Name 28    B    1   3
33 Name 35    B    1   3
49 Name 46    B    2   3

$`4`
      Name Var1 Var2 grp
2   Name 3    A    1   4
14 Name 20    A    2   4
22 Name 14    B    1   4
35 Name 39    B    1   4
50 Name 48    B    2   4
51 Name 49    B    2   4

$`5`
      Name Var1 Var2 grp
9  Name 38    A    1   5
15 Name 23    A    2   5
23 Name 15    B    1   5
27 Name 21    B    1   5
43 Name 26    B    2   5
44 Name 27    B    2   5

$`6`
      Name Var1 Var2 grp
13  Name 2    A    2   6
16 Name 37    A    2   6
31 Name 32    B    1   6
37 Name 47    B    1   6
40 Name 12    B    2   6
47 Name 42    B    2   6

$`7`
      Name Var1 Var2 grp
8  Name 31    A    1   7
10 Name 41    A    1   7
20  Name 7    B    1   7
30 Name 29    B    1   7
45 Name 30    B    2   7
46 Name 33    B    2   7

$`8`
      Name Var1 Var2 grp
4   Name 9    A    1   8
11 Name 43    A    1   8
24 Name 16    B    1   8
38 Name 50    B    1   8
42 Name 25    B    2   8

$`9`
      Name Var1 Var2 grp
3   Name 4    A    1   9
21  Name 8    B    1   9
26 Name 18    B    1   9
36 Name 40    B    1   9
like image 39
ThomasIsCoding Avatar answered May 21 '26 20:05

ThomasIsCoding