Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove any co-occurrence of sub-list elements from vector (R)

Tags:

r

I review the python question How to remove every occurrence of sub-list from list. Now I want to know how many creative ways are there in R.
For example, removing any occurrences of sub_list from the main_list.

main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)

desired result: 2 3 4 2 2 1

My suggestions:

a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 4 2 2 1

2

as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))

Ohh it is really dangerous. Let's try:

main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However 
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1]  2  3 12  4  2  2  1

Update Sat Sep 08 2018

Benchmarking Solutions:

I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem and microbenchmark libraries.

set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))   
sub_list<-c(8,9,12,103)

d.b's solution does not work for main_list so I modified it as follows:

ML = paste(main_list, collapse = ",")  # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\\,","",out)
out<-gsub("\\,$","",out)
out<-gsub("\\,,","\\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))  
The result:
  solution       seconds memory_byte memory_base seconds_base
  <chr>            <dbl>       <dbl>       <dbl>        <dbl>
1 d.b              26.0    399904560        1           16.8 
2 Grothendieck_2    1.55  1440070304        3.60         1   
3 Grothendieck_1  109.    4968036376       12.4         70.3 
4 李哲源            2.17  1400120824        3.50         1.40

Any comment about the benchmarking?

like image 453
Iman Avatar asked Jan 27 '23 15:01

Iman


2 Answers

Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.

1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.

library(zoo)

w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1

2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.

w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1
like image 189
G. Grothendieck Avatar answered Feb 01 '23 22:02

G. Grothendieck


Here is a function that does this general thing.

  • xm is a main list of integer / character / logical values;
  • xs is a sub list of integer /character / logical values.

It is required that length(xm) > length(xs) but no such check is made right now.


foo <- function (xm, xs) {
  nm <- length(xm)
  ns <- length(xs)
  shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
  d <- xm[shift_ind] == xs
  first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
  if (length(first_drop_ind) > 0L) {
    drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
    return(xm[-drop_ind])
    } else {
    return(xm)
    }
  }

main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1

Explanation

xm <- main_list
xs <- sub_list

nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,]    2    1    2    3    1    2    4    2    2
#[2,]    1    2    3    1    2    4    2    2    1

So the first step is a shifting and matrix representation, as above.

LOGIC <- MAT == xs
#      [,1] [,2]  [,3]  [,4] [,5]  [,6]  [,7]  [,8]  [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE  TRUE  TRUE FALSE

If a co-occurrence is found, a column should contain all TRUE, i.e., the colSums should be ns. In this way we can identify the location of the first value of the matching.

first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5

Now we need to expand it to cover the subsequent values after those initial matches.

drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
#     [,1] [,2]
#[1,]    2    5
#[2,]    3    6

Finally we remove values at those positions from xm:

xm[-drop_ind]
#[1] 2 3 4 2 2 1

Note that in the function, the matrix is not explicitly formed. .colSums is used instead of colSums.


watch out for bug

The if ... else ... in the function is necessary. If no match is found then drop_ind would be integer(0), and using xm[-drop_ind] gives xm[integer(0)] which is integer(0).


comparison with zoo::rollapplyr

## require package `zoo`
bar <- function (xm, xs) {
  w <- length(xs)
  r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
  if (length(r) > 0L) {
    return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
    } else {
    return(xm)
    }
  }

set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2

library(zoo)

system.time(a <- foo(xm, xs))
#   user  system elapsed 
#  0.004   0.000   0.001 

system.time(b <- bar(xm, xs))
#   user  system elapsed 
#  0.276   0.000   0.273 

all.equal(a, b)
#[1] TRUE

I guess that rollapplyr is slower is because

  • it needs to first coerce xm to a "zoo" object;
  • internally it uses lapply so that there is a frequent jump between R and C.
like image 41
Zheyuan Li Avatar answered Feb 01 '23 22:02

Zheyuan Li