Let's say I have:
> w
digest gene seq
1 InS AB0583 AAB
2 InS AB0583 AABKR
3 InS AB0583 GFHGHGG
4 PAC PU83022 EUT
5 PAC PU83022 HSFSFJF
6 PAC PU83022 EUTCK
7 PAC PU83022 EUTCKJ
8 InS PO93853 HDGJ
9 InS PO93853 HDGJU
10 InS PO93853 YTYEYD
11 InS PO93853 YTYEYDJHSGSG
12 InS PO93853 SALGHAGGEE
I have applied two different methods to identify proteins (decoded with their gene name, w$gene). The methods are encoded in w$digest. As you can see, there may be overlapping sequences of w$seq within each w$gene within each w$digest - e.g. EUT is also within EUTCK, which is within EUTCKJ.
I want to know how many unique amino acids, each letter in w$seq, were identified. Therefore, I need to remove any/all character string(s) that where detected within another character string, but only when grouped_by(digest, gene). The character string with most characters should be kept.
I seek a solution in tidyverse
Help need:
(1) Count the number of characters, and arrange as follows:
w <- w %>%
mutate(count = str_count(seq)) %>%
arrange(digest, gene, count)
So that
> w
digest gene seq count
1 InS AB0583 AAB 3
2 InS AB0583 AABKR 5
3 InS AB0583 GFHGHGG 7
4 InS PO93853 HDGJ 4
5 InS PO93853 HDGJU 5
6 InS PO93853 YTYEYD 6
(2) group_by(digest, gene), and now remove rows that contain a w$seq that is detected within another w$seq (within this grouping), and keep the row where the w$seq has most characters.
Output
> w
digest gene seq count
1 InS AB0583 AAB 3 #* found within:
2 InS AB0583 AABKR 5 #*
3 InS AB0583 GFHGHGG 7
4 InS PO93853 HDGJ 4 #** found within:
5 InS PO93853 HDGJU 5 #**
6 InS PO93853 YTYEYD 6 #***
7 InS PO93853 SALGHAGGEE 10
8 InS PO93853 YTYEYDJHSGSG 12 #***
9 PAC PU83022 EUT 3 #****
10 PAC PU83022 EUTCK 5 #****
11 PAC PU83022 EUTCKJ 6 #****
12 PAC PU83022 HSFSFJF 7
Therefore, Expected output
> w
digest gene seq count
1 InS AB0583 AABKR 5
2 InS AB0583 GFHGHGG 7
3 InS PO93853 HDGJU 5
4 InS PO93853 SALGHAGGEE 10
5 InS PO93853 YTYEYDJHSGSG 12
6 PAC PU83022 EUTCKJ 6
7 PAC PU83022 HSFSFJF 7
Data
w <- data.frame(
digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
seq = c("AAB", "AABKR", "GFHGHGG",
"EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
"HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)
For each group in group_by() you could make a new list column where each row contains all the seq values for that group. You could then do a row-wise operation where you count the number of times each value of seq shows up in all the values. Keeping the ones that only show up once will give you the result you want.
library(dplyr)
library(stringr)
w <- data.frame(
digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
seq = c("AAB", "AABKR", "GFHGHGG",
"EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
"HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)
w <- w %>%
mutate(count = str_count(seq)) %>%
arrange(digest, gene, count)
w %>% group_by(digest, gene) %>%
mutate(all_vals = list(seq)) %>%
rowwise() %>%
mutate(win = sum(grepl(seq, all_vals))) %>%
filter(win == 1) %>%
dplyr::select(-c(win, all_vals))
#> # A tibble: 7 × 4
#> # Rowwise: digest, gene
#> digest gene seq count
#> <chr> <chr> <chr> <int>
#> 1 InS AB0583 AABKR 5
#> 2 InS AB0583 GFHGHGG 7
#> 3 InS PO93853 HDGJU 5
#> 4 InS PO93853 SALGHAGGEE 10
#> 5 InS PO93853 YTYEYDJHSGSG 12
#> 6 PAC PU83022 EUTCKJ 6
#> 7 PAC PU83022 HSFSFJF 7
Created on 2023-05-25 with reprex v2.0.2
Here is a potential solution:
library(tidyverse)
w <- data.frame(
digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
seq = c("AAB", "AABKR", "GFHGHGG",
"EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
"HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)
w %>%
mutate(count = str_count(seq)) %>%
arrange(digest, gene, count) %>%
group_by(digest, gene) %>%
filter(str_count(paste0(seq, collapse = "_"), seq) == 1)
#> # A tibble: 7 × 4
#> # Groups: digest, gene [3]
#> digest gene seq count
#> <chr> <chr> <chr> <int>
#> 1 InS AB0583 AABKR 5
#> 2 InS AB0583 GFHGHGG 7
#> 3 InS PO93853 HDGJU 5
#> 4 InS PO93853 SALGHAGGEE 10
#> 5 InS PO93853 YTYEYDJHSGSG 12
#> 6 PAC PU83022 EUTCKJ 6
#> 7 PAC PU83022 HSFSFJF 7
Created on 2023-05-25 with reprex v2.0.2
This is a bit awkward, but it 'works':
library(tidyverse)
w <- data.frame(
digest = c(rep("InS", 3), rep("PAC", 4), rep("InS", 5)),
gene = c(rep("AB0583", 3), rep("PU83022", 4), rep("PO93853", 5)),
seq = c("AAB", "AABKR", "GFHGHGG",
"EUT", "HSFSFJF", "EUTCK", "EUTCKJ",
"HDGJ", "HDGJU", "YTYEYD", "YTYEYDJHSGSG", "SALGHAGGEE")
)
w %>%
mutate(count = str_count(seq)) %>%
arrange(digest, gene, count) %>%
group_by(digest, gene) %>%
mutate(strings = paste0(seq, collapse = "|")) %>%
rowwise() %>%
mutate(strings = gsub(paste0("\\b", seq, "\\b"), "", strings)) %>%
filter(!grepl(seq, strings)) %>%
select(-strings) %>%
ungroup()
#> # A tibble: 7 × 4
#> digest gene seq count
#> <chr> <chr> <chr> <int>
#> 1 InS AB0583 AABKR 5
#> 2 InS AB0583 GFHGHGG 7
#> 3 InS PO93853 HDGJU 5
#> 4 InS PO93853 SALGHAGGEE 10
#> 5 InS PO93853 YTYEYDJHSGSG 12
#> 6 PAC PU83022 EUTCKJ 6
#> 7 PAC PU83022 HSFSFJF 7
Created on 2023-05-25 with reprex v2.0.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With