Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find strings that contain a sequence of characters regardless of the order in r

I have a dataframe(df)

    V1    V2
1 "BCC"  Yes
2 "ABB"  Yes

I want to find all the strings that contain a certain sequence of characters, regardless of the order. For example if I have the string "CBC" or "CCB" I would like to get

    V1    V2
1 "BCC"  Yes

I've tried with grep, but It only finds the matching patterns

>df[grep("CBC", df$V1),]
1  V1   V2
<0 rows> (or 0-length row.names)

>df[grep("BCC", df$V1),]
   V1   V2
1 "BCC" Yes
like image 465
aipam Avatar asked Dec 17 '22 22:12

aipam


2 Answers

We can create a logical index by splitting the column

i1 <- sapply(strsplit(df$V1, ""), function(x) all(c("B", "C") %in% x))
df[i1, , drop = FALSE]
#   V1  V2
#1 BCC Yes

if we have two datasets and one is a lookup table ('df2'), then split the column into characters,paste the sorted elements, and use %in% to create the logical vector for filtering the rows

v1n <- sapply(strsplit(df1$v1, ""), function(x) paste(sort(x), collapse=""))
v1l <- sapply(strsplit(df2$v1, ""), function(x) paste(sort(x), collapse=""))
df1[v1n %in% v1l, , drop = FALSE]

data

df1 <- data.frame(v1 = c("BCC", "CAB" , "ABB", "CBC", "CCB", "BAB", "CDB"),
     stringsAsFactors = FALSE)
df2 <- data.frame(v1 = c("CBC", "ABB"), stringsAsFactors = FALSE)
like image 158
akrun Avatar answered Dec 25 '22 22:12

akrun


In the comments you mention a lookup table. If this is the case, an approach could be to join both sets together, then use the regex by Wiktor Stribiżew to indicate which are valid

As I'm joining data sets I'm going to use data.table

Method 1: Join everything

library(data.table)

## dummy data, and a lookup table
dt <- data.frame(V1 = c("BCC", "ABB"))
dt_lookup <- data.frame(V1 = c("CBC","BAB", "CCB"))

## convert to data.table
setDT(dt); setDT(dt_lookup)

## add some indexes to keep track of rows from each dt
dt[, idx := .I]
dt_lookup[, l_idx := .I]

## create a column to join on
dt[, key := 1L]
dt_lookup[, key := 1L]

## join EVERYTHING
dt <- dt[
    dt_lookup
    , on = "key"
    , allow.cartesian = T
]

#regex
dt[
    , valid := grepl(paste0("^[",i.V1,"]+$"), V1)
    , by = 1:nrow(dt)
]

#     V1 idx key i.V1 l_idx valid
# 1: BCC   1   1  CBC     1  TRUE
# 2: ABB   2   1  CBC     1 FALSE
# 3: BCC   1   1  BAB     2 FALSE
# 4: ABB   2   1  BAB     2  TRUE
# 5: BCC   1   1  CCB     3  TRUE
# 6: ABB   2   1  CCB     3 FALSE

Method 2: EACHI join

A slightly more memory-efficient approach might be to use this technique by Jaap as it avoids the 'join everything' step, and in stead joins it 'by each i' (row) at a time.

dt_lookup[
    dt, 
    {
        valid = grepl(paste0("^[",i.V1,"]+$"), V1)
        .(
            V1 = V1[valid]
            , idx = i.idx
            , match = i.V1
            , l_idx = l_idx[valid]
            )
    }
    , on = "key"
    , by = .EACHI
]

#    key  V1 idx match l_idx
# 1:   1 CBC   1   BCC     1
# 2:   1 CCB   1   BCC     3
# 3:   1 BAB   2   ABB     2
like image 26
SymbolixAU Avatar answered Dec 25 '22 22:12

SymbolixAU