The following is a sample of my dataset.
plotID Rs.ten Corr.Rs
1 4.7 2.434437263
1 5.4 2.753744943
1 4 2.044908476
1 0 1.19251
1 1.2 1.84929
1 1.7 1.0755
1 2 1.55399
1 4.5 1.45883
1 3 1.12485
1 4.4 1.92245
1 3.6 1.77914
2 -8.0 0.027792795
2 0.2 0.988443802
2 3.5 0.937311439
2 4 1.007496802
2 5.6 1.738293766
2 6.5 1.722974764
2 6.4 1.590481774
2 5.5 1.097063592
2 5.2 1.389683585
2 6.4 1.392490686
2 6.6 1.812855123
2 5 1.42508238
2 0.4 0.90678
2 3.1 1.00162
2 2.7 0.7914
2 5.9 0.81313
2 4.9 0.89668
2 6.3 1.25597
2 4.7 1.03459
3 5 2.265195289
3 5.3 1.655801734
3 4.4 3.593587609
3 4 3.668348047
3 5.2 2.459742028
3 4.3 3.128687638
3 0.7 2.55316
3 3 2.5708
3 2.8 1.34671
3 2.6 1.90105
3 5.6 1.56052
3 4.2 2.26067
3 4.7 2.22488
3 3.7 2.91198
I have 36 groups represented by plotID
. I want to split the dataset into training and testing datasets (60/40, respectively) for each group (plotID
).
In other words, I need a function that will randomly select 60% of the data from plotID
1, plotID
2, plotID
3, etc. for training and leave the remaining 40% from each plotID
for testing. I came close using the following link: Randomly split data by criterion into training and testing data set using R, however, this simply split the entire dataset 60/40 by the total number of groups, not from within each group.
Seems like I'm missing something simple here, but I just can't see it.
Thanks in advance for your help.
Data splitting is an important aspect of data science, particularly for creating models based on data. This technique helps ensure the creation of data models and processes that use data models -- such as machine learning -- are accurate.
The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.
Stratified Split (Py) helps us split our data into 2 samples (i.e Train Data & Test Data),with an additional feature of specifying a column for stratification.
You can use the stratified
function from my "splitstackshape" package:
Here's what 60% of the sample data you shared would look like (as far as number of elements per group):
> table(mydf$plotID) * .6
1 2 3
6.6 11.4 8.4
Load "splitstackshape" and draw the sample:
> library(splitstackshape)
> out <- stratified(mydf, "plotID", .6, bothSets = TRUE)
The result is a list
with two data.table
s, one for the sample (60%) and one for what's left over (40%):
> str(out)
List of 2
$ SAMP1:Classes ‘data.table’ and 'data.frame': 26 obs. of 3 variables:
..$ plotID : int [1:26] 1 1 1 1 1 1 1 2 2 2 ...
..$ Rs.ten : num [1:26] 2 4.4 3.6 3 4 0 4.7 5.9 6.5 6.4 ...
..$ Corr.Rs: num [1:26] 1.55 1.92 1.78 1.12 2.04 ...
..- attr(*, ".internal.selfref")=<externalptr>
$ SAMP2:Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables:
..$ plotID : int [1:18] 1 1 1 1 2 2 2 2 2 2 ...
..$ Rs.ten : num [1:18] 5.4 1.2 1.7 4.5 -8 3.5 5.2 5 0.4 3.1 ...
..$ Corr.Rs: num [1:18] 2.7537 1.8493 1.0755 1.4588 0.0278 ...
..- attr(*, "sorted")= chr "plotID"
..- attr(*, ".internal.selfref")=<externalptr>
> lapply(out, function(x) table(x$plotID))
$SAMP1
1 2 3
7 11 8
$SAMP2
1 2 3
4 8 6
It's generally more convenient to keep related data together in a list
, but if you want separate objects, you can use list2env
, like this:
Notice that I'm starting with just one object in my workspace:
ls()
# [1] "mydf"
list2env(stratified(mydf, "plotID", .6, bothSets = TRUE), envir = .GlobalEnv)
# <environment: R_GlobalEnv>
I now have three objects:
ls()
# [1] "mydf" "SAMP1" "SAMP2"
head(SAMP1)
# plotID Rs.ten Corr.Rs
# 1: 1 2.0 1.553990
# 2: 1 1.7 1.075500
# 3: 1 4.5 1.458830
# 4: 1 3.6 1.779140
# 5: 1 4.0 2.044908
# 6: 1 5.4 2.753745
nrow(SAMP1)
# [1] 26
head(SAMP2)
# plotID Rs.ten Corr.Rs
# 1: 1 4.7 2.434437
# 2: 1 1.2 1.849290
# 3: 1 3.0 1.124850
# 4: 1 4.4 1.922450
# 5: 2 4.0 1.007497
# 6: 2 5.5 1.097064
> nrow(SAMP2)
# [1] 18
What about this?
set.seed(123)
ind_train <- lapply(split(seq(1:nrow(df)), df$plotID), function(x) sample(x, floor(.6*length(x))))
ind_test <- mapply(function(x,y) setdiff(x,y), x = split(seq(1:nrow(df)), df$plotID), y = ind_train)
Which gives you:
df[unlist(ind_test),]
plotID Rs.ten Corr.Rs
2 1 5.4 2.75374494
3 1 4.0 2.04490848
5 1 1.2 1.84929000
6 1 1.7 1.07550000
9 1 3.0 1.12485000
12 2 -8.0 0.02779279
15 2 4.0 1.00749680
16 2 5.6 1.73829377
17 2 6.5 1.72297476
23 2 5.0 1.42508238
27 2 5.9 0.81313000
29 2 6.3 1.25597000
30 2 4.7 1.03459000
32 3 5.3 1.65580173
33 3 4.4 3.59358761
34 3 4.0 3.66834805
39 3 2.8 1.34671000
41 3 5.6 1.56052000
44 3 3.7 2.91198000
> df[unlist(ind_train),]
plotID Rs.ten Corr.Rs
4 1 0.0 1.1925100
8 1 4.5 1.4588300
11 1 3.6 1.7791400
10 1 4.4 1.9224500
7 1 2.0 1.5539900
1 1 4.7 2.4344373
22 2 6.6 1.8128551
28 2 4.9 0.8966800
21 2 6.4 1.3924907
19 2 5.5 1.0970636
26 2 2.7 0.7914000
18 2 6.4 1.5904818
20 2 5.2 1.3896836
25 2 3.1 1.0016200
13 2 0.2 0.9884438
24 2 0.4 0.9067800
14 2 3.5 0.9373114
31 3 5.0 2.2651953
35 3 5.2 2.4597420
42 3 4.2 2.2606700
40 3 2.6 1.9010500
37 3 0.7 2.5531600
36 3 4.3 3.1286876
38 3 3.0 2.5708000
43 3 4.7 2.2248800
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With