Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly splitting data from a grouped dataset

Tags:

r

The following is a sample of my dataset.

plotID  Rs.ten  Corr.Rs
1   4.7 2.434437263
1   5.4 2.753744943
1   4   2.044908476
1   0   1.19251
1   1.2 1.84929
1   1.7 1.0755
1   2   1.55399
1   4.5 1.45883
1   3   1.12485
1   4.4 1.92245
1   3.6 1.77914
2   -8.0    0.027792795
2   0.2 0.988443802
2   3.5 0.937311439
2   4   1.007496802
2   5.6 1.738293766
2   6.5 1.722974764
2   6.4 1.590481774
2   5.5 1.097063592
2   5.2 1.389683585
2   6.4 1.392490686
2   6.6 1.812855123
2   5   1.42508238
2   0.4 0.90678
2   3.1 1.00162
2   2.7 0.7914
2   5.9 0.81313
2   4.9 0.89668
2   6.3 1.25597
2   4.7 1.03459
3   5   2.265195289
3   5.3 1.655801734
3   4.4 3.593587609
3   4   3.668348047
3   5.2 2.459742028
3   4.3 3.128687638
3   0.7 2.55316
3   3   2.5708
3   2.8 1.34671
3   2.6 1.90105
3   5.6 1.56052
3   4.2 2.26067
3   4.7 2.22488
3   3.7 2.91198

I have 36 groups represented by plotID. I want to split the dataset into training and testing datasets (60/40, respectively) for each group (plotID).

In other words, I need a function that will randomly select 60% of the data from plotID 1, plotID 2, plotID 3, etc. for training and leave the remaining 40% from each plotID for testing. I came close using the following link: Randomly split data by criterion into training and testing data set using R, however, this simply split the entire dataset 60/40 by the total number of groups, not from within each group.

Seems like I'm missing something simple here, but I just can't see it.

Thanks in advance for your help.

like image 422
woodland_creature Avatar asked Apr 21 '15 22:04

woodland_creature


People also ask

Why is it important to randomly split the data?

Data splitting is an important aspect of data science, particularly for creating models based on data. This technique helps ensure the creation of data models and processes that use data models -- such as machine learning -- are accurate.

How do you split a dataset into two parts?

The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.

What is a Stratified split?

Stratified Split (Py) helps us split our data into 2 samples (i.e Train Data & Test Data),with an additional feature of specifying a column for stratification.


2 Answers

You can use the stratified function from my "splitstackshape" package:

Here's what 60% of the sample data you shared would look like (as far as number of elements per group):

> table(mydf$plotID) * .6

   1    2    3 
 6.6 11.4  8.4 

Load "splitstackshape" and draw the sample:

> library(splitstackshape)
> out <- stratified(mydf, "plotID", .6, bothSets = TRUE)

The result is a list with two data.tables, one for the sample (60%) and one for what's left over (40%):

> str(out)
List of 2
 $ SAMP1:Classes ‘data.table’ and 'data.frame': 26 obs. of  3 variables:
  ..$ plotID : int [1:26] 1 1 1 1 1 1 1 2 2 2 ...
  ..$ Rs.ten : num [1:26] 2 4.4 3.6 3 4 0 4.7 5.9 6.5 6.4 ...
  ..$ Corr.Rs: num [1:26] 1.55 1.92 1.78 1.12 2.04 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ SAMP2:Classes ‘data.table’ and 'data.frame': 18 obs. of  3 variables:
  ..$ plotID : int [1:18] 1 1 1 1 2 2 2 2 2 2 ...
  ..$ Rs.ten : num [1:18] 5.4 1.2 1.7 4.5 -8 3.5 5.2 5 0.4 3.1 ...
  ..$ Corr.Rs: num [1:18] 2.7537 1.8493 1.0755 1.4588 0.0278 ...
  ..- attr(*, "sorted")= chr "plotID"
  ..- attr(*, ".internal.selfref")=<externalptr> 
> lapply(out, function(x) table(x$plotID))
$SAMP1

 1  2  3 
 7 11  8 

$SAMP2

1 2 3 
4 8 6 

It's generally more convenient to keep related data together in a list, but if you want separate objects, you can use list2env, like this:

Notice that I'm starting with just one object in my workspace:

ls()
# [1] "mydf"
list2env(stratified(mydf, "plotID", .6, bothSets = TRUE), envir = .GlobalEnv)
# <environment: R_GlobalEnv>

I now have three objects:

ls()
# [1] "mydf"  "SAMP1" "SAMP2"
head(SAMP1)
#    plotID Rs.ten  Corr.Rs
# 1:      1    2.0 1.553990
# 2:      1    1.7 1.075500
# 3:      1    4.5 1.458830
# 4:      1    3.6 1.779140
# 5:      1    4.0 2.044908
# 6:      1    5.4 2.753745
nrow(SAMP1)
# [1] 26
head(SAMP2)
#    plotID Rs.ten  Corr.Rs
# 1:      1    4.7 2.434437
# 2:      1    1.2 1.849290
# 3:      1    3.0 1.124850
# 4:      1    4.4 1.922450
# 5:      2    4.0 1.007497
# 6:      2    5.5 1.097064
> nrow(SAMP2)
# [1] 18
like image 179
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 15 '22 05:11

A5C1D2H2I1M1N2O1R2T1


What about this?

set.seed(123)
ind_train <- lapply(split(seq(1:nrow(df)), df$plotID), function(x) sample(x, floor(.6*length(x))))
ind_test <- mapply(function(x,y) setdiff(x,y), x = split(seq(1:nrow(df)), df$plotID), y = ind_train)

Which gives you:

 df[unlist(ind_test),]
   plotID Rs.ten    Corr.Rs
2       1    5.4 2.75374494
3       1    4.0 2.04490848
5       1    1.2 1.84929000
6       1    1.7 1.07550000
9       1    3.0 1.12485000
12      2   -8.0 0.02779279
15      2    4.0 1.00749680
16      2    5.6 1.73829377
17      2    6.5 1.72297476
23      2    5.0 1.42508238
27      2    5.9 0.81313000
29      2    6.3 1.25597000
30      2    4.7 1.03459000
32      3    5.3 1.65580173
33      3    4.4 3.59358761
34      3    4.0 3.66834805
39      3    2.8 1.34671000
41      3    5.6 1.56052000
44      3    3.7 2.91198000
> df[unlist(ind_train),]
   plotID Rs.ten   Corr.Rs
4       1    0.0 1.1925100
8       1    4.5 1.4588300
11      1    3.6 1.7791400
10      1    4.4 1.9224500
7       1    2.0 1.5539900
1       1    4.7 2.4344373
22      2    6.6 1.8128551
28      2    4.9 0.8966800
21      2    6.4 1.3924907
19      2    5.5 1.0970636
26      2    2.7 0.7914000
18      2    6.4 1.5904818
20      2    5.2 1.3896836
25      2    3.1 1.0016200
13      2    0.2 0.9884438
24      2    0.4 0.9067800
14      2    3.5 0.9373114
31      3    5.0 2.2651953
35      3    5.2 2.4597420
42      3    4.2 2.2606700
40      3    2.6 1.9010500
37      3    0.7 2.5531600
36      3    4.3 3.1286876
38      3    3.0 2.5708000
43      3    4.7 2.2248800
like image 26
DatamineR Avatar answered Nov 15 '22 04:11

DatamineR