Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split string and count alphabet frequency using dplyr pipe

Tags:

I have the following data frame:

library(tidyverse)
dat <- structure(list(fasta_header = c(">seq1", ">seq2"), sequence = c("MPSRGTRPE", 
"VSSKYTFWNF")), .Names = c("fasta_header", "sequence"), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"))


dat
#> # A tibble: 2 x 2
#>   fasta_header sequence  
#>   <chr>        <chr>     
#> 1 >seq1        MPSRGTRPE 
#> 2 >seq2        VSSKYTFWNF

What I want to do is to calculate the frequency of amino acid for every row. The desired result is this (by hand)

   fasta_header sequence    M  P   S  R  G   T  E  V  K  Y  F  W  N
   >seq1        MPSRGTRPE   1  1   1  2  1   1  1  0  0  0  0  0  0
   >seq2        VSSKYTFWNF  0  0   2  0  0   1  0  1  1  1  2  1  1

How can I do that with dplyr piping method?

like image 854
scamander Avatar asked Apr 04 '18 08:04

scamander


2 Answers

The comments above are right, but if you really want a tidyverse pipeline...

library(tidyverse)                     #uses dplyr, purrr, tidyr and stringr
dat %>% mutate(split=map(sequence, ~unlist(str_split(., "")))) %>% #split into characters
  unnest() %>%                         #unnest into a new column
  group_by(fasta_header, sequence) %>% #group
  count(split) %>%                     #count letters for each group
  spread(key=split, value=n, fill=0)   #convert to wide format

  fasta_header sequence       E     F     G     K     M     N     P     R     S     T     V     W     Y
  <chr>        <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 >seq1        MPSRGTRPE     1.    0.    1.    0.    1.    0.    2.    2.    1.    1.    0.    0.    0.
2 >seq2        VSSKYTFWNF    0.    2.    0.    1.    0.    1.    0.    0.    2.    1.    1.    1.    1.
like image 138
Andrew Gustar Avatar answered Sep 22 '22 12:09

Andrew Gustar


Here you go

library(tidyverse)
library(stringr)
library(dplyr)
dat <- structure(list(fasta_header = c(">seq1", ">seq2"), sequence = c("MPSRGTRPE", 
"VSSKYTFWNF")), .Names = c("fasta_header", "sequence"), row.names = c(NA, 
                                                                                                                                             -2L), class = c("tbl_df", "tbl", "data.frame"))
# Vector of unique amino acids 
uniqueaa <- as.character(dat$`sequence`) %>% strsplit(split="")  %>%
  c() %>% unlist() %>% unique() %>% data.frame(stringsAsFactors = F)   
colnames(uniqueaa) <- "uniqueaa"
# Count occurences
result <- apply(uniqueaa,1,function(x) str_count(dat$sequence, x["uniqueaa"]))
colnames(result) <- uniqueaa$uniqueaa
rownames(result) <- dat$sequence
result
           M P S R G T E V K Y F W N
MPSRGTRPE  1 2 1 2 1 1 1 0 0 0 0 0 0
VSSKYTFWNF 0 0 2 0 0 1 0 1 1 1 2 1 1
like image 28
gaut Avatar answered Sep 19 '22 12:09

gaut