My data looks like below:
df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L,
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport",
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis",
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport",
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
I want to count how many similar strings are there but also have a track from which row they come from. Each string is separated by a ;
but they belong to the row that they are in there.
I want to have the output like this:
String Count position
mRNA 1 1
stimulus 3 1,6,11
transport 4 1,5,9,11
MAPK cascade 2 2,5
cell and biogenesis 3 2,5,9
targeting 2 3,4
regulation of mRNA stability 1 1
regulation 2 6,11
differentiation 1 6,11
metabolic process 2 6,11
The count shows how many times each of the string (the string are separated by a semicolon) is repeated in the entire data. Second column shows where they were, for example mRNA was only in the first row so it is 1. stimulus was in three rows 1 and 6 and 11
Some rows are blank and they are also count as rows.
In the code below we do the following:
strplit
to split each string into its components and store the result in a column called string
.strsplit
returns a list. We use unnest
to stack the list components to create a "long" data frame, giving us a "tidy" data frame that's ready to summarize.string
and return a new data frame that counts the frequency of each string and gives the original row number in which each instance of the string originally appeared.library(tidyverse)
df$V1 = as.character(df$V1)
df %>%
rownames_to_column() %>%
mutate(string = strsplit(V1, ";")) %>%
unnest %>%
group_by(string) %>%
summarise(count = n(),
rows = paste(rowname, collapse=","))
string count rows 1 cell and biogenesis 3 2,5,9 2 differentiation 1 6 3 MAPK cascade 2 2,5 4 metabolic process 2 6,11 5 mRNA 1 1 6 regulation 2 6,11 7 stimulus 3 1,6,11 8 targeting 2 3,4 9 transport 4 1,5,9,11
If you plan to do further processing on the row numbers, you might want to keep them as numeric values, rather than as a string of pasted values. In that case, you could do this:
df.new = df %>%
rownames_to_column("rows") %>%
mutate(string = strsplit(V1, ";")) %>%
select(-V1) %>%
unnest
This will give you a long data frame with one row for each combination of string
and rows
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With