How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

Question

Let's say I have:

> w
   digest    gene          seq
1     InS  AB0583          AAB
2     InS  AB0583        AABKR
3     InS  AB0583      GFHGHGG
4     PAC PU83022          EUT
5     PAC PU83022      HSFSFJF
6     PAC PU83022        EUTCK
7     PAC PU83022       EUTCKJ
8     InS PO93853         HDGJ
9     InS PO93853        HDGJU
10    InS PO93853       YTYEYD
11    InS PO93853 YTYEYDJHSGSG
12    InS PO93853   SALGHAGGEE

I have applied two different methods to identify proteins (decoded with their gene name, w$gene). The methods are encoded in w$digest. As you can see, there may be overlapping sequences of w$seq within each w$gene within each w$digest - e.g. EUT is also within EUTCK, which is within EUTCKJ.

I want to know how many unique amino acids, each letter in w$seq, were identified. Therefore, I need to remove any/all character string(s) that where detected within another character string, but only when grouped_by(digest, gene). The character string with most characters should be kept.

I seek a solution in tidyverse

Help need:

(1) Count the number of characters, and arrange as follows:

w <- w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count)

So that

> w
   digest    gene          seq count
1     InS  AB0583          AAB     3
2     InS  AB0583        AABKR     5
3     InS  AB0583      GFHGHGG     7
4     InS PO93853         HDGJ     4
5     InS PO93853        HDGJU     5
6     InS PO93853       YTYEYD     6

(2) group_by(digest, gene), and now remove rows that contain a w$seq that is detected within another w$seq (within this grouping), and keep the row where the w$seq has most characters.

Output

> w
   digest    gene          seq count
1     InS  AB0583          AAB     3 #* found within:
2     InS  AB0583        AABKR     5 #*
3     InS  AB0583      GFHGHGG     7
4     InS PO93853         HDGJ     4 #** found within:
5     InS PO93853        HDGJU     5 #**
6     InS PO93853       YTYEYD     6 #***
7     InS PO93853   SALGHAGGEE    10
8     InS PO93853 YTYEYDJHSGSG    12 #***
9     PAC PU83022          EUT     3 #****
10    PAC PU83022        EUTCK     5 #****
11    PAC PU83022       EUTCKJ     6 #****
12    PAC PU83022      HSFSFJF     7

Therefore, Expected output

> w
   digest    gene          seq count
1     InS  AB0583        AABKR     5 
2     InS  AB0583      GFHGHGG     7
3     InS PO93853        HDGJU     5 
4     InS PO93853   SALGHAGGEE    10
5     InS PO93853 YTYEYDJHSGSG    12 
6     PAC PU83022       EUTCKJ     6 
7     PAC PU83022      HSFSFJF     7

Data

w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

DaveArmstrong · Accepted Answer

For each group in group_by() you could make a new list column where each row contains all the seq values for that group. You could then do a row-wise operation where you count the number of times each value of seq shows up in all the values. Keeping the ones that only show up once will give you the result you want.

library(dplyr)
library(stringr)
w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w <- w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) 

w %>% group_by(digest, gene) %>%
  mutate(all_vals = list(seq)) %>% 
  rowwise() %>% 
  mutate(win = sum(grepl(seq, all_vals))) %>% 
  filter(win == 1) %>% 
  dplyr::select(-c(win, all_vals))
#> # A tibble: 7 × 4
#> # Rowwise:  digest, gene
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

^{Created on 2023-05-25 with reprex v2.0.2}

jared_mamrot · Answer

Edit:

Here is a potential solution:

library(tidyverse)

w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) %>%
  group_by(digest, gene) %>%
  filter(str_count(paste0(seq, collapse = "_"), seq) == 1)
#> # A tibble: 7 × 4
#> # Groups:   digest, gene [3]
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

^{Created on 2023-05-25 with reprex v2.0.2}

Original answer:

This is a bit awkward, but it 'works':

library(tidyverse)

w <- data.frame(
  digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
  gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
  seq = c("AAB", "AABKR", "GFHGHGG",
          "EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
          "HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)

w %>%
  mutate(count = str_count(seq)) %>%
  arrange(digest, gene, count) %>%
  group_by(digest, gene) %>%
  mutate(strings = paste0(seq, collapse = "|")) %>%
  rowwise() %>%
  mutate(strings = gsub(paste0("\b", seq, "\b"), "", strings)) %>%
  filter(!grepl(seq, strings)) %>%
  select(-strings) %>%
  ungroup()
#> # A tibble: 7 × 4
#>   digest gene    seq          count
#>   <chr>  <chr>   <chr>        <int>
#> 1 InS    AB0583  AABKR            5
#> 2 InS    AB0583  GFHGHGG          7
#> 3 InS    PO93853 HDGJU            5
#> 4 InS    PO93853 SALGHAGGEE      10
#> 5 InS    PO93853 YTYEYDJHSGSG    12
#> 6 PAC    PU83022 EUTCKJ           6
#> 7 PAC    PU83022 HSFSFJF          7

^{Created on 2023-05-25 with reprex v2.0.2}

How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

Tags:

string

dataframe

r

group-by

tidyverse

cmirian

2 Answers

DaveArmstrong

Edit:

Original answer:

jared_mamrot

Recent Activity

Donate For Us

How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

Tags:

string

dataframe

r

group-by

tidyverse

cmirian

2 Answers

DaveArmstrong

Edit:

Original answer:

jared_mamrot

Related questions

Recent Activity

Donate For Us