Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count strings separated by a semicolon

Tags:

string

r

count

My data looks like below:

df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L, 
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport", 
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis", 
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport", 
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
-11L))

I want to count how many similar strings are there but also have a track from which row they come from. Each string is separated by a ; but they belong to the row that they are in there.

I want to have the output like this:

String                           Count        position 
mRNA                                 1        1
stimulus                             3        1,6,11
transport                            4        1,5,9,11
MAPK cascade                         2        2,5
cell and biogenesis                  3        2,5,9
targeting                            2        3,4
regulation of mRNA stability         1        1
regulation                           2        6,11
differentiation                      1        6,11
metabolic process                    2        6,11

The count shows how many times each of the string (the string are separated by a semicolon) is repeated in the entire data. Second column shows where they were, for example mRNA was only in the first row so it is 1. stimulus was in three rows 1 and 6 and 11

Some rows are blank and they are also count as rows.

like image 773
nik Avatar asked Feb 05 '23 09:02

nik


1 Answers

In the code below we do the following:

  1. Add the row numbers as a column.
  2. Use strplit to split each string into its components and store the result in a column called string.
  3. strsplit returns a list. We use unnest to stack the list components to create a "long" data frame, giving us a "tidy" data frame that's ready to summarize.
  4. Group by string and return a new data frame that counts the frequency of each string and gives the original row number in which each instance of the string originally appeared.

library(tidyverse)

df$V1 = as.character(df$V1)

df %>% 
  rownames_to_column() %>% 
  mutate(string = strsplit(V1, ";")) %>% 
  unnest %>%
  group_by(string) %>%
  summarise(count = n(),
            rows = paste(rowname, collapse=","))
               string count     rows
1 cell and biogenesis     3    2,5,9
2     differentiation     1        6
3        MAPK cascade     2      2,5
4   metabolic process     2     6,11
5                mRNA     1        1
6          regulation     2     6,11
7            stimulus     3   1,6,11
8           targeting     2      3,4
9           transport     4 1,5,9,11

If you plan to do further processing on the row numbers, you might want to keep them as numeric values, rather than as a string of pasted values. In that case, you could do this:

df.new = df %>% 
  rownames_to_column("rows") %>% 
  mutate(string = strsplit(V1, ";")) %>% 
  select(-V1) %>%
  unnest

This will give you a long data frame with one row for each combination of string and rows.

like image 154
eipi10 Avatar answered Feb 06 '23 23:02

eipi10