To illustrate my question, a dummy example: I have a data set with 16 rows (these represent trials) and 3 columns (trial difficulty, label X, and label Y). Label X is a factor with 4 levels (1–4), and label Y is a factor with 2 levels ("female", "male"). For example:
difficulty X Y
trial1 3.0 1 male
trial2 1.4 1 male
trial3 2.1 1 female
trial4 1.5 1 female
trial5 0.3 2 male
trial6 1.2 2 male
trial7 3.0 2 female
trial8 1.6 2 female
trial9 0.8 3 male
trial10 1.4 3 male
trial11 2.8 3 female
trial12 1.5 3 female
trial13 0.3 4 male
trial14 1.2 4 male
trial15 3.0 4 female
trial16 1.6 4 female
I should like to create a subset of 8 trials from the total of 16 trials; a subset that should adhere to the following criteria:
For my example, the ideal set in this dummy example would be:
difficulty X Y
trial2 1.4 1 male
trial4 1.5 1 female
trial6 1.2 2 male
trial8 1.6 2 female
trial10 1.4 3 male
trial12 1.5 3 female
trial14 1.2 4 male
trial16 1.6 4 female
This subset has 2 trials per level of X, and an equal number of females and males for each level of X, while all trials have a difficulty value that is as close as possible to 1.5.
My attempts have been to use many nested while
and if
loops, but am not sure how to check for two variables at the same time (at the moment I'm looping until X is fulfilled, then looping until Y is fulfilled, then looping until X is fulfilled again, etc.). Would this be the right approach, or would there be a more sensible way of doing this?
The following code assumes your data frame is called dat
. The code adds a new variable difficulty.scaled
equal to the deviation of difficulty
from 1.5, then groups the data by values of X and Y, and then selects the observations within each group with absolute value of difficulty.scaled
closest to 0 (i.e., difficulty
closest to 1.5).
You can adjust the probs
argument to the quantile
function to select whatever percentage of each subgroup that you want. In this case, I've selected 50% of the rows in each subgroup (that is, 50% of the rows representing each combination of X
and Y
).
library(dplyr) # Install the dplyr package if you don't already have it
dat2 = dat %.%
mutate(difficulty.scaled=difficulty - 1.5) %.%
group_by(X, Y) %.%
filter(abs(difficulty.scaled) < quantile(abs(difficulty.scaled), .5))
For the data you pasted in above (where I've converted the trial number to a variable), here's the output:
tnum difficulty X Y difficulty.scaled
1 trial2 1.4 1 male -0.1
2 trial4 1.5 1 female 0.0
3 trial6 1.2 2 male -0.3
4 trial8 1.6 2 female 0.1
5 trial10 1.4 3 male -0.1
6 trial12 1.5 3 female 0.0
7 trial14 1.2 4 male -0.3
8 trial16 1.6 4 female 0.1
The data you provided has equal numbers of observations for each combination of X
and Y
. If your real data are unbalanced on these variables, then instead of selecting a percentage of the rows in each sub-group, you can select a specific number of rows. The code below selects the n
rows with the lowest absolute value of difficulty.scaled
in each sub-group. That way your subset will be balanced even if your full data set is not (as long as you have at least n
rows of data for each combination of X
and Y
).
n=1
dat2 = dat %.%
mutate(difficulty.scaled=difficulty - 1.5) %.%
group_by(X, Y) %.%
filter(rank(abs(difficulty.scaled), ties.method="first") <= n)
ties.method="first"
ensures that exactly n
rows will be returned, even if there is more than one row with the same absolute value of difficulty.scaled
.
Update: How to divide subsetted data into training and test sets.
Assuming dat2
is your balanced subset, you can divide it into training and test subsets as follows:
# Note that you need to use %>% instead of %.%
train = dat2 %>%
do(sample_n(., 10))
This will return 10 randomly sampled rows per sub-group. Just set this value to whatever number of rows per sub-group you want in your training sample. Notice that you don't need to group by X and Y to create the training sample. This is because when you created dat2
, dplyr
added grouping attributes to dat2
that dplyr
continues to recognize. Do str(dat2)
to see this.
do
is a generic function that allows you to perform arbitrary operations on a data frame from within dplyr
. The period .
is kind of a "pronoun" that represents the data frame (dat2
in this case). This will only work with %>%
instead of %.%
. (dplyr
is in active development and is transitioning from %.%
to %>%
for chaining operations, so it's probably best to just use %>%
from now on.)
# The test set then includes all rows that are not part of train.
# Since tnum has a unique value for each row, use tnum to select all rows that
# are not part of train.
test = dat2[!(dat2$tnum %in% train$tnum), ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With