Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a random subsample by ID and with a certain factor distribution in R

I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:

1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples

2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.

3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.

What would be the best solution to split my dataset in the way described above?

like image 872
lole_emily Avatar asked Jun 15 '20 11:06

lole_emily


People also ask

How to create random samples in R?

To do this, use the set. seed() function. Using set. seed() will force R to produce consistent random samples at any time on any computer.

How does sample function work in R?

The sample() function in R allows you to take a random sample of elements from a dataset or a vector, either with or without replacement. The complete documentation for sample() can be found here. The following examples illustrate practical examples of using sample().


2 Answers

Here the short version:

library(tidyverse)

df <- tribble(
    ~Book_ID, ~sentence_ID, ~Colour, ~Sentences
    ,1      , 1, "Blue", "Text goes here"
    ,1      , 2, "Blue", "Text goes here"
    ,1      , 3, "Blue", "Text goes here"
    ,2      , 4, "Red", "Text goes here"
    ,2      , 5, "Red", "Text goes here"
    ,3      , 6, "Green", "Text goes here"
    ,4      , 7, "Orange", "Text goes here"
    ,4      , 8, "Orange", "Text goes here"
    ,4      , 9, "Orange", "Text goes here"
    ,4      , 10, "Orange", "Text goes here"
    ,4      , 11, "Orange", "Text goes here"
    ,5      , 12, "Blue", "Text goes here"
    ,5      , 13, "Blue", "Text goes here"
    ,6      , 14, "Red", "Text goes here"
    ,6      , 15, "Red", "Text goes here"
)

df %>%
    left_join(
        df %>%
            distinct(Book_ID, Colour) %>%
            group_by(Colour) %>%
            mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
        , by = c("Book_ID", "Colour"))

This will give you:

# A tibble: 15 x 5
   Book_ID sentence_ID Colour Sentences        sub_sample
     <dbl>       <dbl> <chr>  <chr>                 <int>
 1       1           1 Blue   "Text goes here"          3
 2       1           2 Blue   "Text goes here"          3
 3       1           3 Blue   "Text goes here"          3
 4       2           4 Red    "Text goes here"          1
 5       2           5 Red    "Text goes here"          1
 6       3           6 Green  "Text goes here"          1
 7       4           7 Orange "Text goes here"          2
 8       4           8 Orange "Text goes here"          2
 9       4           9 Orange "Text goes here"          2
10       4          10 Orange "Text goes here"          2
11       4          11 Orange "Text goes here"          2
12       5          12 Blue   "Text goes here"          2
13       5          13 Blue   "Text goes here"          2
14       6          14 Red    "Text goes here"          3
15       6          15 Red    "Text goes here"          3

And a short explanation of the code:

Let's start with the nested part

# take the dataframe
df %>%
    # ...and extract the distinct combinations of book and colour
    distinct(Book_ID, Colour) %>%
    # and now for each colour...
    group_by(Colour) %>%
    # ...provide random numbers from 1 to 4
    mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))

Grouping by colour ensures that you have the same distribution of colours in each sample.

The result of this is now left_joined to the original dataframe on the two columns we "distincted" before - which ensures that there can be no duplicates.


One addition

To have the same colour distribution in the subsamples you of course need a sufficient number of books for each colour. So, for example, only 20 different books in green is guaranteed to be differently distributed. In that case you would probably want to "group" colours for the sampling. However, that's a statistics question and clearly beyond the scope of a programming forum.

like image 67
Georgery Avatar answered Oct 13 '22 06:10

Georgery


This should work. Books are grouped by color, and then a number from 1:4 is drawn from a pool that has length: next multiple of 4, to ensure equal distribution. The data frame is then split by sample number.

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample) 
$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

like image 41
JohannesNE Avatar answered Oct 13 '22 07:10

JohannesNE