R - sample used in %in% modify dataframe which is being subsetted

Tags:

Not sure if I titled question correctly, because I don't fully understand the reason of following behaviour:

dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE)

dfSet[1:10, ]$va1 <- 'o1'
dfSet[11:15, ]$va1 <- 'o2'

dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1

print(length(unique(dfSet$ID)))

I expect that final print shows 15, but it doesn't. Instead 13 or 14 appears and dfSet is modified in the way, that there are at least two rows with the same ID. It seems that this part of the code:

dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1

modify $ID column - I don't know why?

Workaround:

temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE)
dfSet[dfSet$ID %in% temp, ]$va3 <- 1

In this case everything works as expected - there are 15 rows with unique ID.

The question is why direct usage of sample in %in% modifies data frame?

291

asked Aug 03 '16 19:08

her_dom

1 Answers

What seems to be the problem is that that R does some tricky thing when you assign to function return values. For example, something like

a <- c(1,3)
names(a) <- c("one", "three")

would look very odd in most languages. How do you assign a value to the return value of a function? What's really happening is that there is a function named names<- that's defined. Basically that's returning a transformed version of the original object that can then be used to replace the value passed to that function. So it really looks like this

.temp. <- `names<-`(a, c("one","three"))
a <- .temp.

The variable a is always completely replaced, not just it's names.

When you do something like

dfSet$a<-1

what's really happening again is

.temp. <- "$<-"(dfSet, a, 1)
dfSet <- .temp.

Now things get a bit more tricky when you try to do both [] and $ subsetting. Look at this sample

#for subsetting
f <- function(x,v) {print("testing"); x==v}
x <- rep(0:1, length.out=nrow(dfSet))
dfSet$a <- 0

dfSet[f(x,1),]$a<-1

Notice how "testing" is printed twice. What's going on is really more like

.temp1. <- "$<-"(dfSet[f(x,1),], a, 1)
.temp2. <- "[<-"(dfSet, f(x,1), , .temp1.)
dfSet <- .temp2.

So the f(x,1) is evaluated twice. This means that sample would be evaluated twice as well.

The error is a bit more obvious is you try to replace a variable that does not exist yet

dfSet[f(x,1),]$b<-1
# Warning message:
# In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L,  :
#  provided 4 variables to replace 3 variables

Here you get the warning because the .temp1. variable as added the column and now has 4 columns but when you try to do the assignment to .temp2. you now have a problem that the slice of the data frame that you are trying to replace is a different size.

The IDs are replaced because the $<- operator doesn't just return a new column, it returns a new data.frame with the column updated to whatever value you assigned. This means that the rows that were updated are returned along with the ID that was there when the assignment happened. This is saved in the .temp1. variable. Then when you do the [<- assignment, you are choosing a new set of rows to swap out. The values of all columns of these rows are replaced with the values from .temp1.. This means that you will be overwriting the IDs for the replacement rows and they may differ so you are likely to wind up with two or more copies of a given ID.

152

answered Sep 20 '22 01:09

MrFlick

Related questions
                            
                                Change the size of the arrowheads in a markov chain plot
                            
                                geom_text with dodged barplot
                            
                                How to get screen resolution from JavaScript in R Shiny?
                            
                                ggplot2, facet wrap, fixed y scale for each row, free scale between rows
                            
                                Visualize Parse Tree Structure
                            
                                How display length of branches in phylogenetic tree
                            
                                vim-rmarkdown plugin configuration
                            
                                A caterpillar plot of just the "significant" random effects from a mixed effects model
                            
                                R - Split by "\n" or three spaces and retain at least one space when there are three spaces
                            
                                Fastest way for doing 21 day rolling sum for an ActivityType
                            
                                Aggregating all unique values of each column of data frame
                            
                                How to merge multiple data.frames and sum and average columns at the same time in R
                            
                                ggplot: line plot for discrete x-axis
                            
                                R foreach: from single-machine to cluster
                            
                                Identify a weblink in bold in R
                            
                                Change values in data frame in a specific row using dplyr
                            
                                Delete Redundant columns in R [duplicate]
                            
                                glmnet: How do I know which factor level of my response is coded as 1 in logistic regression
                            
                                Put quotation marks around each element of a vector, and separate with comma
                            
                                How does the PACKAGE argument to .Call work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R - sample used in %in% modify dataframe which is being subsetted

Tags:

r

subset

sample

her_dom

People also ask

1 Answers

MrFlick

Recent Activity

Donate For Us