I have a large data frame (df) with binomial values ranging from 1 to 2. NAs are also included in the data. As a practical example, I will create a short vector containing a subset of a user's data:
df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)
What I would basically like as an outcome is a function that searches for the first and the second 2
s of an array and transforms everything within this interval to a 2. Nevertheless, if the difference between the positions of the second and the first 2 are > 3 then the values stay as they are and no change is performed.
In addition to the above, the function has to loop for each value of df
. For example, considering again the case of:
df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)
The function should have this outcome:
df_outcome <- c(NA,NA,2,1,1,1,2,2,2,2,1,1,1,NA,2,2,2,2,1,1,1,2)
Notice that in df_outcome
the values between the very first and second 2's were not merged as the difference in their position was >3. On the other hand, other non-2 values were changed accordingly.
What I have tried to do (but it does not work properly):
With the help of rollapply
in the zoo
package, I have tried to create a function that finds the first and the second 2 of the array and performs the modifications as described above.
func <- function (q) {
for (i in (which(q %in% 2)[1]):(which(q %in% 2)[2])) {
q[i]<-2
}
return(q)
}
Then I nested it using rollapply
so I can specify a specific width for each cycle plus other arguments such as the position of the result index (left).
df_outcome<-rollapply(df, width = 3, FUN = func, fill = NA, partial = TRUE, align = "left")
The problem is that the user-generated function works if applied to a vector. When nested as an argument in the rollapply
function however, it returns an error:
Error in (which(q %in% 2)[1]):(which(q %in% 2)[2]) : NA/NaN argument Called from: FUN(data[replace(posns, !ix, 0)], ...)
I guess there is some mistake from my part in the use of rollapply
or perhaps the format of the data but I cannot understand what could be the issue. I thought about using rollapply
because my data is very long and it is generated for different users. Hence, I would need a function that can also split the data with regards to other variables such as User_ID
(much like the .variables
argument in ddply
or by
in data.table
.
Looking forward for your support.
A solution with rle
:
rldf <- rle(df)
rllag <- c(tail(rldf$values,-1), NA)
rllead <- c(NA, head(rldf$values,-1))
rldf$values[which(rldf$values == 1 & rllag == 2 & rllead == 2 & rldf$lengths < 3)] <- 2
df_out <- inverse.rle(rldf)
which gives:
> df_out
[1] NA NA 2 1 1 1 2 2 2 2 1 1 1 NA 2 2 2 2 1 1 1 2
> identical(df_outcome,df_out)
[1] TRUE
You can try to get the indices of the 2
in df
.
Then get the difference between those position and thus find the indices of values to replace by 2
:
# position of the 2s
pos_df_2 <- which(df==2)
# which of the difference in positions are less than 3
wh_pos2_inf3 <- which(c(FALSE, diff(pos_df_2)<=3))
# get all indices between positions that are separated by less than 3 elements
ind_to_replace <- unique(unlist(sapply(wh_pos2_inf3, function(x) {pos_df_2[x-1]:pos_df_2[x]})))
# replace the elements by 2
df[ind_to_replace] <- 2
df
#[1] NA NA 2 1 1 1 2 2 2 2 1 1 1 NA 2 2 2 2 1 1 1 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With