Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a subset that is balanced across multiple variables

Tags:

r

subset

To illustrate my question, a dummy example: I have a data set with 16 rows (these represent trials) and 3 columns (trial difficulty, label X, and label Y). Label X is a factor with 4 levels (1–4), and label Y is a factor with 2 levels ("female", "male"). For example:

        difficulty    X    Y
trial1   3.0           1    male
trial2   1.4           1    male
trial3   2.1           1    female
trial4   1.5           1    female
trial5   0.3           2    male
trial6   1.2           2    male
trial7   3.0           2    female
trial8   1.6           2    female
trial9   0.8           3    male
trial10  1.4           3    male
trial11  2.8           3    female
trial12  1.5           3    female
trial13  0.3           4    male
trial14  1.2           4    male
trial15  3.0           4    female
trial16  1.6           4    female

I should like to create a subset of 8 trials from the total of 16 trials; a subset that should adhere to the following criteria:

  1. there is an equal number of trials within the four levels of label X
  2. there is an equal number of trials within the two levels of label Y (and there should also be an equal number of trials for each level of label Y within the four levels of label X)
  3. the trial difficulty variable (numeric, ranging from 0 to 3) should be as close as possible to 1.5

For my example, the ideal set in this dummy example would be:

        difficulty    X    Y
trial2   1.4           1    male
trial4   1.5           1    female
trial6   1.2           2    male
trial8   1.6           2    female
trial10  1.4           3    male
trial12  1.5           3    female
trial14  1.2           4    male
trial16  1.6           4    female

This subset has 2 trials per level of X, and an equal number of females and males for each level of X, while all trials have a difficulty value that is as close as possible to 1.5.

My attempts have been to use many nested while and if loops, but am not sure how to check for two variables at the same time (at the moment I'm looping until X is fulfilled, then looping until Y is fulfilled, then looping until X is fulfilled again, etc.). Would this be the right approach, or would there be a more sensible way of doing this?

like image 377
rvrvrv Avatar asked Oct 21 '22 08:10

rvrvrv


1 Answers

The following code assumes your data frame is called dat. The code adds a new variable difficulty.scaled equal to the deviation of difficulty from 1.5, then groups the data by values of X and Y, and then selects the observations within each group with absolute value of difficulty.scaled closest to 0 (i.e., difficulty closest to 1.5).

You can adjust the probs argument to the quantile function to select whatever percentage of each subgroup that you want. In this case, I've selected 50% of the rows in each subgroup (that is, 50% of the rows representing each combination of X and Y).

library(dplyr)  # Install the dplyr package if you don't already have it
dat2 = dat %.%
         mutate(difficulty.scaled=difficulty - 1.5) %.%
         group_by(X, Y) %.%
         filter(abs(difficulty.scaled) < quantile(abs(difficulty.scaled), .5))

For the data you pasted in above (where I've converted the trial number to a variable), here's the output:

     tnum difficulty X      Y difficulty.scaled
1  trial2        1.4 1   male              -0.1
2  trial4        1.5 1 female               0.0
3  trial6        1.2 2   male              -0.3
4  trial8        1.6 2 female               0.1
5 trial10        1.4 3   male              -0.1
6 trial12        1.5 3 female               0.0
7 trial14        1.2 4   male              -0.3
8 trial16        1.6 4 female               0.1

The data you provided has equal numbers of observations for each combination of X and Y. If your real data are unbalanced on these variables, then instead of selecting a percentage of the rows in each sub-group, you can select a specific number of rows. The code below selects the n rows with the lowest absolute value of difficulty.scaled in each sub-group. That way your subset will be balanced even if your full data set is not (as long as you have at least n rows of data for each combination of X and Y).

n=1
dat2 = dat %.%
         mutate(difficulty.scaled=difficulty - 1.5) %.%
         group_by(X, Y) %.%
         filter(rank(abs(difficulty.scaled), ties.method="first") <= n)

ties.method="first" ensures that exactly n rows will be returned, even if there is more than one row with the same absolute value of difficulty.scaled.

Update: How to divide subsetted data into training and test sets.

Assuming dat2 is your balanced subset, you can divide it into training and test subsets as follows:

# Note that you need to use %>% instead of %.%
train = dat2 %>%
  do(sample_n(., 10)) 

This will return 10 randomly sampled rows per sub-group. Just set this value to whatever number of rows per sub-group you want in your training sample. Notice that you don't need to group by X and Y to create the training sample. This is because when you created dat2, dplyr added grouping attributes to dat2 that dplyr continues to recognize. Do str(dat2) to see this.

do is a generic function that allows you to perform arbitrary operations on a data frame from within dplyr. The period . is kind of a "pronoun" that represents the data frame (dat2 in this case). This will only work with %>% instead of %.%. (dplyr is in active development and is transitioning from %.% to %>% for chaining operations, so it's probably best to just use %>% from now on.)

# The test set then includes all rows that are not part of train. 
# Since tnum has a unique value for each row, use tnum to select all rows that 
# are not part of train.
test = dat2[!(dat2$tnum %in% train$tnum), ]
like image 159
eipi10 Avatar answered Oct 23 '22 01:10

eipi10