Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - sample used in %in% modify dataframe which is being subsetted

Tags:

r

subset

sample

Not sure if I titled question correctly, because I don't fully understand the reason of following behaviour:

dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE)

dfSet[1:10, ]$va1 <- 'o1'
dfSet[11:15, ]$va1 <- 'o2'

dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1

print(length(unique(dfSet$ID)))

I expect that final print shows 15, but it doesn't. Instead 13 or 14 appears and dfSet is modified in the way, that there are at least two rows with the same ID. It seems that this part of the code:

dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1

modify $ID column - I don't know why?

Workaround:

temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE)
dfSet[dfSet$ID %in% temp, ]$va3 <- 1

In this case everything works as expected - there are 15 rows with unique ID.

The question is why direct usage of sample in %in% modifies data frame?

like image 291
her_dom Avatar asked Aug 03 '16 19:08

her_dom


People also ask

What is the use of subset () and sample () function in R?

The difference between subset () function and sample () is that, subset () is used to select data from the dataset which meets certain condition, while sample () is used for randomly selecting data of size 'n' from the dataset.

Which R function can be used to make changes to a data frame?

transform() function in R Language is used to modify data. It converts the first argument to the data frame. This function is used to transform/modify the data frame in a quick and easy way.

What does subset () do in R?

Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.

What are the three subsetting operators in R?

There are three subsetting operators, [[ , [ , and $ . Subsetting operators interact differently with different vector types (e.g., atomic vectors, lists, factors, matrices, and data frames). Subsetting can be combined with assignment.


1 Answers

What seems to be the problem is that that R does some tricky thing when you assign to function return values. For example, something like

a <- c(1,3)
names(a) <- c("one", "three")

would look very odd in most languages. How do you assign a value to the return value of a function? What's really happening is that there is a function named names<- that's defined. Basically that's returning a transformed version of the original object that can then be used to replace the value passed to that function. So it really looks like this

.temp. <- `names<-`(a, c("one","three"))
a <- .temp.

The variable a is always completely replaced, not just it's names.

When you do something like

dfSet$a<-1

what's really happening again is

.temp. <- "$<-"(dfSet, a, 1)
dfSet <- .temp.

Now things get a bit more tricky when you try to do both [] and $ subsetting. Look at this sample

#for subsetting
f <- function(x,v) {print("testing"); x==v}
x <- rep(0:1, length.out=nrow(dfSet))
dfSet$a <- 0

dfSet[f(x,1),]$a<-1

Notice how "testing" is printed twice. What's going on is really more like

.temp1. <- "$<-"(dfSet[f(x,1),], a, 1)
.temp2. <- "[<-"(dfSet, f(x,1), , .temp1.)
dfSet <- .temp2.

So the f(x,1) is evaluated twice. This means that sample would be evaluated twice as well.

The error is a bit more obvious is you try to replace a variable that does not exist yet

dfSet[f(x,1),]$b<-1
# Warning message:
# In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L,  :
#  provided 4 variables to replace 3 variables

Here you get the warning because the .temp1. variable as added the column and now has 4 columns but when you try to do the assignment to .temp2. you now have a problem that the slice of the data frame that you are trying to replace is a different size.

The IDs are replaced because the $<- operator doesn't just return a new column, it returns a new data.frame with the column updated to whatever value you assigned. This means that the rows that were updated are returned along with the ID that was there when the assignment happened. This is saved in the .temp1. variable. Then when you do the [<- assignment, you are choosing a new set of rows to swap out. The values of all columns of these rows are replaced with the values from .temp1.. This means that you will be overwriting the IDs for the replacement rows and they may differ so you are likely to wind up with two or more copies of a given ID.

like image 152
MrFlick Avatar answered Sep 20 '22 01:09

MrFlick