I review the python question How to remove every occurrence of sub-list from list.
Now I want to know how many creative ways are there in R.
For example, removing any occurrences of sub_list
from the main_list
.
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
desired result: 2 3 4 2 2 1
My suggestions:
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 4 2 2 1
2
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
Ohh it is really dangerous. Let's try:
main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 12 4 2 2 1
I Benchmarked solutions base on the memory and time, each solution takes, with a big vector of numbers and used profmem
and microbenchmark
libraries.
set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))
sub_list<-c(8,9,12,103)
d.b
's solution does not work for main_list
so I modified it as follows:
ML = paste(main_list, collapse = ",") # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\\,","",out)
out<-gsub("\\,$","",out)
out<-gsub("\\,,","\\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))
The result:
solution seconds memory_byte memory_base seconds_base
<chr> <dbl> <dbl> <dbl> <dbl>
1 d.b 26.0 399904560 1 16.8
2 Grothendieck_2 1.55 1440070304 3.60 1
3 Grothendieck_1 109. 4968036376 12.4 70.3
4 李哲源 2.17 1400120824 3.50 1.40
Any comment about the benchmarking?
Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.
1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list)
having the required length to the sub_list
. (We append sub_list
to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.
library(zoo)
w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1
2) Base R. The middle line setting r
has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use +
instead of -
due to the fact that embed
effectively uses left alignment.
w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1
Here is a function that does this general thing.
xm
is a main list of integer / character / logical values;xs
is a sub list of integer /character / logical values.It is required that length(xm) > length(xs)
but no such check is made right now.
foo <- function (xm, xs) {
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
d <- xm[shift_ind] == xs
first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
if (length(first_drop_ind) > 0L) {
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
return(xm[-drop_ind])
} else {
return(xm)
}
}
main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1
Explanation
xm <- main_list
xs <- sub_list
nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 2 1 2 3 1 2 4 2 2
#[2,] 1 2 3 1 2 4 2 2 1
So the first step is a shifting and matrix representation, as above.
LOGIC <- MAT == xs
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
If a co-occurrence is found, a column should contain all TRUE
, i.e., the colSums
should be ns
. In this way we can identify the location of the first value of the matching.
first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5
Now we need to expand it to cover the subsequent values after those initial matches.
drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
# [,1] [,2]
#[1,] 2 5
#[2,] 3 6
Finally we remove values at those positions from xm
:
xm[-drop_ind]
#[1] 2 3 4 2 2 1
Note that in the function, the matrix is not explicitly formed. .colSums
is used instead of colSums
.
watch out for bug
The if ... else ...
in the function is necessary. If no match is found then drop_ind
would be integer(0)
, and using xm[-drop_ind]
gives xm[integer(0)]
which is integer(0)
.
comparison with zoo::rollapplyr
## require package `zoo`
bar <- function (xm, xs) {
w <- length(xs)
r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
if (length(r) > 0L) {
return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
} else {
return(xm)
}
}
set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2
library(zoo)
system.time(a <- foo(xm, xs))
# user system elapsed
# 0.004 0.000 0.001
system.time(b <- bar(xm, xs))
# user system elapsed
# 0.276 0.000 0.273
all.equal(a, b)
#[1] TRUE
I guess that rollapplyr
is slower is because
xm
to a "zoo" object;lapply
so that there is a frequent jump between R and C.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With