Create a random subsample by ID and with a certain factor distribution in R

Tags:

I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:

1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples

2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.

3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.

What would be the best solution to split my dataset in the way described above?

872

asked Jun 15 '20 11:06

lole_emily

2 Answers

Here the short version:

library(tidyverse)

df <- tribble(
    ~Book_ID, ~sentence_ID, ~Colour, ~Sentences
    ,1      , 1, "Blue", "Text goes here"
    ,1      , 2, "Blue", "Text goes here"
    ,1      , 3, "Blue", "Text goes here"
    ,2      , 4, "Red", "Text goes here"
    ,2      , 5, "Red", "Text goes here"
    ,3      , 6, "Green", "Text goes here"
    ,4      , 7, "Orange", "Text goes here"
    ,4      , 8, "Orange", "Text goes here"
    ,4      , 9, "Orange", "Text goes here"
    ,4      , 10, "Orange", "Text goes here"
    ,4      , 11, "Orange", "Text goes here"
    ,5      , 12, "Blue", "Text goes here"
    ,5      , 13, "Blue", "Text goes here"
    ,6      , 14, "Red", "Text goes here"
    ,6      , 15, "Red", "Text goes here"
)

df %>%
    left_join(
        df %>%
            distinct(Book_ID, Colour) %>%
            group_by(Colour) %>%
            mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
        , by = c("Book_ID", "Colour"))

This will give you:

# A tibble: 15 x 5
   Book_ID sentence_ID Colour Sentences        sub_sample
     <dbl>       <dbl> <chr>  <chr>                 <int>
 1       1           1 Blue   "Text goes here"          3
 2       1           2 Blue   "Text goes here"          3
 3       1           3 Blue   "Text goes here"          3
 4       2           4 Red    "Text goes here"          1
 5       2           5 Red    "Text goes here"          1
 6       3           6 Green  "Text goes here"          1
 7       4           7 Orange "Text goes here"          2
 8       4           8 Orange "Text goes here"          2
 9       4           9 Orange "Text goes here"          2
10       4          10 Orange "Text goes here"          2
11       4          11 Orange "Text goes here"          2
12       5          12 Blue   "Text goes here"          2
13       5          13 Blue   "Text goes here"          2
14       6          14 Red    "Text goes here"          3
15       6          15 Red    "Text goes here"          3

And a short explanation of the code:

Let's start with the nested part

# take the dataframe
df %>%
    # ...and extract the distinct combinations of book and colour
    distinct(Book_ID, Colour) %>%
    # and now for each colour...
    group_by(Colour) %>%
    # ...provide random numbers from 1 to 4
    mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))

Grouping by colour ensures that you have the same distribution of colours in each sample.

The result of this is now left_joined to the original dataframe on the two columns we "distincted" before - which ensures that there can be no duplicates.

One addition

To have the same colour distribution in the subsamples you of course need a sufficient number of books for each colour. So, for example, only 20 different books in green is guaranteed to be differently distributed. In that case you would probably want to "group" colours for the sampling. However, that's a statistics question and clearly beyond the scope of a programming forum.

answered Oct 13 '22 06:10

Georgery

This should work. Books are grouped by color, and then a number from 1:4 is drawn from a pool that has length: next multiple of 4, to ensure equal distribution. The data frame is then split by sample number.

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample)

$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

answered Oct 13 '22 07:10

JohannesNE

Related questions
                            
                                Splitting strings into number and string (with missings)
                            
                                fitting first order equation with nlme and lsoda
                            
                                How can I make this histogram in ggplot / R?
                            
                                compact/efficient replacement for diag(X V X^T)?
                            
                                Unexpected behaviour when piping to ls() in R?
                            
                                left_join for tbl: na_matches not working
                            
                                numbering characters in a string
                            
                                Port JavaScript async and igraph code to R?
                            
                                Is there an "unfilter" in dplyr to merge changes with the original dataset?
                            
                                ggplot - geom_segment() with mutiple arrows
                            
                                List of vectors slower in Julia than R?
                            
                                Set environment variable in bash in Rmarkdown
                            
                                Is there a way to add an alpha value within geom_image in ggplot?
                            
                                Add index to sortable object text in R Sortable
                            
                                Show all used references when using compact numeric citation style
                            
                                Selecting and modifying rows in a dataframe based on criteria
                            
                                Emi calculation for the grace month in R
                            
                                How to restructure dataframe having multiple variable by keeping some variable static & dynamic in R
                            
                                How to get rid of axis flickering caused by coord_flip when using gganimate and view_follow and geom_tile?
                            
                                Replacing group_by_at(NULL) using across

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create a random subsample by ID and with a certain factor distribution in R

Tags:

merge

random

r

subset

sampling

lole_emily

People also ask

2 Answers

Georgery

JohannesNE

Recent Activity

Donate For Us