I have a dataframe(df)
V1 V2
1 "BCC" Yes
2 "ABB" Yes
I want to find all the strings that contain a certain sequence of characters, regardless of the order. For example if I have the string "CBC" or "CCB" I would like to get
V1 V2
1 "BCC" Yes
I've tried with grep, but It only finds the matching patterns
>df[grep("CBC", df$V1),]
1 V1 V2
<0 rows> (or 0-length row.names)
>df[grep("BCC", df$V1),]
V1 V2
1 "BCC" Yes
We can create a logical index by splitting the column
i1 <- sapply(strsplit(df$V1, ""), function(x) all(c("B", "C") %in% x))
df[i1, , drop = FALSE]
# V1 V2
#1 BCC Yes
if we have two datasets and one is a lookup table ('df2'), then split the column into characters,paste
the sort
ed elements, and use %in%
to create the logical vector
for filtering the rows
v1n <- sapply(strsplit(df1$v1, ""), function(x) paste(sort(x), collapse=""))
v1l <- sapply(strsplit(df2$v1, ""), function(x) paste(sort(x), collapse=""))
df1[v1n %in% v1l, , drop = FALSE]
df1 <- data.frame(v1 = c("BCC", "CAB" , "ABB", "CBC", "CCB", "BAB", "CDB"),
stringsAsFactors = FALSE)
df2 <- data.frame(v1 = c("CBC", "ABB"), stringsAsFactors = FALSE)
In the comments you mention a lookup table. If this is the case, an approach could be to join both sets together, then use the regex by Wiktor Stribiżew to indicate which are valid
As I'm joining data sets I'm going to use data.table
library(data.table)
## dummy data, and a lookup table
dt <- data.frame(V1 = c("BCC", "ABB"))
dt_lookup <- data.frame(V1 = c("CBC","BAB", "CCB"))
## convert to data.table
setDT(dt); setDT(dt_lookup)
## add some indexes to keep track of rows from each dt
dt[, idx := .I]
dt_lookup[, l_idx := .I]
## create a column to join on
dt[, key := 1L]
dt_lookup[, key := 1L]
## join EVERYTHING
dt <- dt[
dt_lookup
, on = "key"
, allow.cartesian = T
]
#regex
dt[
, valid := grepl(paste0("^[",i.V1,"]+$"), V1)
, by = 1:nrow(dt)
]
# V1 idx key i.V1 l_idx valid
# 1: BCC 1 1 CBC 1 TRUE
# 2: ABB 2 1 CBC 1 FALSE
# 3: BCC 1 1 BAB 2 FALSE
# 4: ABB 2 1 BAB 2 TRUE
# 5: BCC 1 1 CCB 3 TRUE
# 6: ABB 2 1 CCB 3 FALSE
A slightly more memory-efficient approach might be to use this technique by Jaap as it avoids the 'join everything' step, and in stead joins it 'by each i' (row) at a time.
dt_lookup[
dt,
{
valid = grepl(paste0("^[",i.V1,"]+$"), V1)
.(
V1 = V1[valid]
, idx = i.idx
, match = i.V1
, l_idx = l_idx[valid]
)
}
, on = "key"
, by = .EACHI
]
# key V1 idx match l_idx
# 1: 1 CBC 1 BCC 1
# 2: 1 CCB 1 BCC 3
# 3: 1 BAB 2 ABB 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With