I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.
My dataset
Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here
.
I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:
1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples
2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.
3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.
What would be the best solution to split my dataset in the way described above?
To do this, use the set. seed() function. Using set. seed() will force R to produce consistent random samples at any time on any computer.
The sample() function in R allows you to take a random sample of elements from a dataset or a vector, either with or without replacement. The complete documentation for sample() can be found here. The following examples illustrate practical examples of using sample().
Here the short version:
library(tidyverse)
df <- tribble(
~Book_ID, ~sentence_ID, ~Colour, ~Sentences
,1 , 1, "Blue", "Text goes here"
,1 , 2, "Blue", "Text goes here"
,1 , 3, "Blue", "Text goes here"
,2 , 4, "Red", "Text goes here"
,2 , 5, "Red", "Text goes here"
,3 , 6, "Green", "Text goes here"
,4 , 7, "Orange", "Text goes here"
,4 , 8, "Orange", "Text goes here"
,4 , 9, "Orange", "Text goes here"
,4 , 10, "Orange", "Text goes here"
,4 , 11, "Orange", "Text goes here"
,5 , 12, "Blue", "Text goes here"
,5 , 13, "Blue", "Text goes here"
,6 , 14, "Red", "Text goes here"
,6 , 15, "Red", "Text goes here"
)
df %>%
left_join(
df %>%
distinct(Book_ID, Colour) %>%
group_by(Colour) %>%
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
, by = c("Book_ID", "Colour"))
This will give you:
# A tibble: 15 x 5
Book_ID sentence_ID Colour Sentences sub_sample
<dbl> <dbl> <chr> <chr> <int>
1 1 1 Blue "Text goes here" 3
2 1 2 Blue "Text goes here" 3
3 1 3 Blue "Text goes here" 3
4 2 4 Red "Text goes here" 1
5 2 5 Red "Text goes here" 1
6 3 6 Green "Text goes here" 1
7 4 7 Orange "Text goes here" 2
8 4 8 Orange "Text goes here" 2
9 4 9 Orange "Text goes here" 2
10 4 10 Orange "Text goes here" 2
11 4 11 Orange "Text goes here" 2
12 5 12 Blue "Text goes here" 2
13 5 13 Blue "Text goes here" 2
14 6 14 Red "Text goes here" 3
15 6 15 Red "Text goes here" 3
And a short explanation of the code:
Let's start with the nested part
# take the dataframe
df %>%
# ...and extract the distinct combinations of book and colour
distinct(Book_ID, Colour) %>%
# and now for each colour...
group_by(Colour) %>%
# ...provide random numbers from 1 to 4
mutate(sub_sample = sample.int(4, size = n(), replace = TRUE))
Grouping by colour ensures that you have the same distribution of colours in each sample.
The result of this is now left_join
ed to the original dataframe on the two columns we "distincted" before - which ensures that there can be no duplicates.
One addition
To have the same colour distribution in the subsamples you of course need a sufficient number of books for each colour. So, for example, only 20 different books in green is guaranteed to be differently distributed. In that case you would probably want to "group" colours for the sampling. However, that's a statistics question and clearly beyond the scope of a programming forum.
This should work. Books are grouped by color, and then a number from 1:4 is drawn from a pool that has length: next multiple of 4, to ensure equal distribution. The data frame is then split by sample number.
library(readr)
library(dplyr)
library(tidyr)
books <- read_delim(
'Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here',
'|', trim_ws = TRUE)
books %>%
# sampling is done on a book ID level. We group by book
# and nest the sentences, to get only one row per book.
group_by(`Book ID`) %>%
nest(book_data = c(`sentence ID`, Sentences)) %>%
# We want to split colours evenly. We therefore draw a sample number from 1:4
# for each group of colours. To ensure an even split, we draw from a
# vector that is a repeat of 1:4 until it has a lenght, that is the
# first multiple of 4, that is >= the number of colours in a group.
group_by(Colour) %>%
mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>%
# Unnest the sentences again.
unnest(book_data) %>%
# Split the data frame into lists by the sample number.
split(.$sample)
$`1`
# A tibble: 4 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 5 Blue 12 Text goes here 1
2 5 Blue 13 Text goes here 1
3 6 Red 14 Text goes here 1
4 6 Red 15 Text goes here 1
$`2`
# A tibble: 2 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 2 Red 4 Text goes here 2
2 2 Red 5 Text goes here 2
$`3`
# A tibble: 1 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 3 Green 6 Text goes here 3
$`4`
# A tibble: 8 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 1 Blue 1 Text goes here 4
2 1 Blue 2 Text goes here 4
3 1 Blue 3 Text goes here 4
4 4 Orange 7 Text goes here 4
5 4 Orange 8 Text goes here 4
6 4 Orange 9 Text goes here 4
7 4 Orange 10 Text goes here 4
8 4 Orange 11 Text goes here 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With