Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to segment data into two sets using dplyr's setdiff

Tags:

r

I'm using dplyr to do a simple split of some data into training and test.

When I do a simple example, it works great:

a = c(1, 2, 3, 4, 5, 6, 7, 8)
b = c("A", "B", "C", "D", "E", "F", "G", "H")

df = data.frame(a, b)

train = sample_frac(df, 0.8)
test = setdiff(df, train)

> nrow(train) + nrow(test) == nrow(df)
[1] TRUE

However when I try to do the same thing using the classic UCI Wine dataset, I don't seem to get the same results:

wine = read.csv("http://www.nd.edu/~mclark19/learn/data/goodwine.csv")

wine_train = sample_frac(wine, 0.8)
wine_test = setdiff(wine, wine_train)

> nrow(wine_train) + nrow(wine_test) == nrow(wine)
[1] FALSE
> nrow(wine_train) + nrow(wine_test)
[1] 6105
> nrow(wine)
[1] 6497

Is there something about the behavior of setdiff that I'm missing?

Thanks, AG

like image 454
dreww2 Avatar asked Mar 08 '26 15:03

dreww2


1 Answers

Maybe because there are duplicated lines:

>any(duplicated(wine))
[1] TRUE

If you clean your dataset:

drunk = wine[!duplicated(wine),]
drunk_train = sample_frac(drunk, 0.8)
drunk_test = setdiff(drunk, drunk_train)
nrow(drunk_test) + nrow(drunk_train) == nrow(drunk)
[1] TRUE
like image 137
Colonel Beauvel Avatar answered Mar 10 '26 07:03

Colonel Beauvel