Find strings that contain a sequence of characters regardless of the order in r

Question

I have a dataframe(df)

    V1    V2
1 "BCC"  Yes
2 "ABB"  Yes

I want to find all the strings that contain a certain sequence of characters, regardless of the order. For example if I have the string "CBC" or "CCB" I would like to get

    V1    V2
1 "BCC"  Yes

I've tried with grep, but It only finds the matching patterns

>df[grep("CBC", df$V1),]
1  V1   V2
<0 rows> (or 0-length row.names)

>df[grep("BCC", df$V1),]
   V1   V2
1 "BCC" Yes

akrun · Accepted Answer

We can create a logical index by splitting the column

i1 <- sapply(strsplit(df$V1, ""), function(x) all(c("B", "C") %in% x))
df[i1, , drop = FALSE]
#   V1  V2
#1 BCC Yes

if we have two datasets and one is a lookup table ('df2'), then split the column into characters,paste the sorted elements, and use %in% to create the logical vector for filtering the rows

v1n <- sapply(strsplit(df1$v1, ""), function(x) paste(sort(x), collapse=""))
v1l <- sapply(strsplit(df2$v1, ""), function(x) paste(sort(x), collapse=""))
df1[v1n %in% v1l, , drop = FALSE]

data

df1 <- data.frame(v1 = c("BCC", "CAB" , "ABB", "CBC", "CCB", "BAB", "CDB"),
     stringsAsFactors = FALSE)
df2 <- data.frame(v1 = c("CBC", "ABB"), stringsAsFactors = FALSE)

SymbolixAU · Answer

In the comments you mention a lookup table. If this is the case, an approach could be to join both sets together, then use the regex by Wiktor Stribiżew to indicate which are valid

As I'm joining data sets I'm going to use data.table

Method 1: Join everything

library(data.table)

## dummy data, and a lookup table
dt <- data.frame(V1 = c("BCC", "ABB"))
dt_lookup <- data.frame(V1 = c("CBC","BAB", "CCB"))

## convert to data.table
setDT(dt); setDT(dt_lookup)

## add some indexes to keep track of rows from each dt
dt[, idx := .I]
dt_lookup[, l_idx := .I]

## create a column to join on
dt[, key := 1L]
dt_lookup[, key := 1L]

## join EVERYTHING
dt <- dt[
    dt_lookup
    , on = "key"
    , allow.cartesian = T
]

#regex
dt[
    , valid := grepl(paste0("^[",i.V1,"]+$"), V1)
    , by = 1:nrow(dt)
]

#     V1 idx key i.V1 l_idx valid
# 1: BCC   1   1  CBC     1  TRUE
# 2: ABB   2   1  CBC     1 FALSE
# 3: BCC   1   1  BAB     2 FALSE
# 4: ABB   2   1  BAB     2  TRUE
# 5: BCC   1   1  CCB     3  TRUE
# 6: ABB   2   1  CCB     3 FALSE

Method 2: EACHI join

A slightly more memory-efficient approach might be to use this technique by Jaap as it avoids the 'join everything' step, and in stead joins it 'by each i' (row) at a time.

dt_lookup[
    dt, 
    {
        valid = grepl(paste0("^[",i.V1,"]+$"), V1)
        .(
            V1 = V1[valid]
            , idx = i.idx
            , match = i.V1
            , l_idx = l_idx[valid]
            )
    }
    , on = "key"
    , by = .EACHI
]

#    key  V1 idx match l_idx
# 1:   1 CBC   1   BCC     1
# 2:   1 CCB   1   BCC     3
# 3:   1 BAB   2   ABB     2

Find strings that contain a sequence of characters regardless of the order in r

Tags:

string

pattern-matching

dataframe

r

permutation

aipam

2 Answers

data

akrun

Method 1: Join everything

Method 2: EACHI join

SymbolixAU

Recent Activity

Donate For Us

Find strings that contain a sequence of characters regardless of the order in r

Tags:

string

pattern-matching

dataframe

r

permutation

aipam

2 Answers

data

akrun

Method 1: Join everything

Method 2: EACHI join

SymbolixAU

Related questions

Recent Activity

Donate For Us