Basically, I have a dataframe, df
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F
Pathway8 Z G NA NA E
Pathway9 A G Z H F
Pathway6 Y G Z H E
Pathway2 A G D NA F
Pathway5 Q G D NA E
Pathway1 A D K NA F
Pathway7 A B C D F
Pathway4 V B C D E
And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A,Z G NA NA F,E
Pathway9 A,Y G Z H F,E
Pathway2 A,Q G D NA F,E
Pathway1 A D K NA F
Pathway7 A,V B C D F,E
This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.
So far, I have tried:
library(dat.table)
dat<-data.table(df)
Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]
Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]
which gives the output:
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 G NA NA F,E
Pathway9 G Z H F,E
Pathway2 G D NA F,E
Pathway1 D K NA F
Pathway7 B C D F,E
Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.
I am sorry if it is a simple error- I am pretty new to R.
Here's a possible solution using dplyr
df %>% group_by_at(vars(Protein2:Protein4)) %>%
summarize_all(paste, collapse=",")
Using data.table
you can use .SD
to refer to all columns not specified in the by
argument. Then we can use lapply
to accomplish the paste()
with collapse
.
library(data.table)
dt <- read.table(text = "Beginning1 Protein2 Protein3 Biomarker1
A G NA NA F
Z G NA NA E
A G Z H F
Y G Z H E
A G D NA F
Q G D NA E
A D K NA F
A B C D F
V B C D E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")),
by=.(Protein2, Protein3, Protein4)]
Output
Protein2 Protein3 Protein4 Beginning1 Biomarker1
1: G NA NA A, Z F, E
2: G Z H A, Y F, E
3: G D NA A, Q F, E
4: D K NA A F
5: B C D A, V F, E
We can use aggregate
from base R
r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
# Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1 B C D A, V F, E
#2 G Z H A, Y F, E
#3 G D NA A, Q F, E
#4 D K NA A F
#5 G NA NA A, Z F, E
r1[r1=="NA"] <- NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With