Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find similar elements of a vector and modify everything inbetween

I have a large data frame (df) with binomial values ranging from 1 to 2. NAs are also included in the data. As a practical example, I will create a short vector containing a subset of a user's data:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

What I would basically like as an outcome is a function that searches for the first and the second 2s of an array and transforms everything within this interval to a 2. Nevertheless, if the difference between the positions of the second and the first 2 are > 3 then the values stay as they are and no change is performed.

In addition to the above, the function has to loop for each value of df. For example, considering again the case of:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

The function should have this outcome:

df_outcome <- c(NA,NA,2,1,1,1,2,2,2,2,1,1,1,NA,2,2,2,2,1,1,1,2)

Notice that in df_outcome the values between the very first and second 2's were not merged as the difference in their position was >3. On the other hand, other non-2 values were changed accordingly.

What I have tried to do (but it does not work properly):

With the help of rollapply in the zoo package, I have tried to create a function that finds the first and the second 2 of the array and performs the modifications as described above.

func <- function (q) {
  for (i in (which(q %in% 2)[1]):(which(q %in% 2)[2])) {
    q[i]<-2
  }
  return(q)
}

Then I nested it using rollapplyso I can specify a specific width for each cycle plus other arguments such as the position of the result index (left).

df_outcome<-rollapply(df, width = 3, FUN = func, fill = NA, partial = TRUE, align = "left")

The problem is that the user-generated function works if applied to a vector. When nested as an argument in the rollapply function however, it returns an error:

Error in (which(q %in% 2)[1]):(which(q %in% 2)[2]) : NA/NaN argument Called from: FUN(data[replace(posns, !ix, 0)], ...)

I guess there is some mistake from my part in the use of rollapply or perhaps the format of the data but I cannot understand what could be the issue. I thought about using rollapplybecause my data is very long and it is generated for different users. Hence, I would need a function that can also split the data with regards to other variables such as User_ID (much like the .variablesargument in ddply or by in data.table.

Looking forward for your support.

like image 302
IlBardo Avatar asked Dec 05 '22 16:12

IlBardo


2 Answers

A solution with rle:

rldf <- rle(df)
rllag <- c(tail(rldf$values,-1), NA)
rllead <- c(NA, head(rldf$values,-1))

rldf$values[which(rldf$values == 1 & rllag == 2 & rllead == 2 & rldf$lengths < 3)] <- 2

df_out <- inverse.rle(rldf)

which gives:

> df_out
 [1] NA NA  2  1  1  1  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2

> identical(df_outcome,df_out)
[1] TRUE
like image 171
Jaap Avatar answered Jan 19 '23 10:01

Jaap


You can try to get the indices of the 2 in df. Then get the difference between those position and thus find the indices of values to replace by 2:

# position of the 2s
pos_df_2 <- which(df==2) 
# which of the difference in positions are less than 3
wh_pos2_inf3 <- which(c(FALSE, diff(pos_df_2)<=3))
# get all indices between positions that are separated by less than 3 elements
ind_to_replace <- unique(unlist(sapply(wh_pos2_inf3, function(x) {pos_df_2[x-1]:pos_df_2[x]}))) 
# replace the elements by 2
df[ind_to_replace] <- 2 
df
#[1] NA NA  2  1  1  1  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2
like image 35
Cath Avatar answered Jan 19 '23 11:01

Cath