Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapse columns in a dataframe (R)

Basically, I have a dataframe, df

                  Beginning1 Protein2    Protein3    Protein4    Biomarker1
      Pathway3    A         G           NA           NA           F
      Pathway8    Z         G           NA           NA           E
      Pathway9    A         G           Z            H            F
      Pathway6    Y         G           Z            H            E
      Pathway2    A         G           D            NA           F
      Pathway5    Q         G           D            NA           E
      Pathway1    A         D           K            NA           F
      Pathway7    A         B           C            D            F
      Pathway4    V         B           C            D            E

And I want to combine the dataframe so that those rows when are identical from "Protein2" to "Protein4" are condense, giving the following:

            Beginning1 Protein2     Protein3     Protein4     Biomarker1
Pathway3    A,Z         G           NA           NA           F,E
Pathway9    A,Y         G           Z            H            F,E
Pathway2    A,Q         G           D            NA           F,E
Pathway1    A           D           K            NA           F
Pathway7    A,V         B           C            D            F,E

This is very similar to a question that I asked before (Consolidating duplicate rows in a dataframe), however the difference is that I am also consolidating the "Beginning1" row.

So far, I have tried:

library(dat.table)
dat<-data.table(df)

Total_collapse <- dat[, .(
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein1, Protein2, Protein3)]

Total_collapse <- dat[, .(
Beginning1 = paste0(Beginning1, collapse = ", ")),
by = .(Protein1, Protein2, Protein3)]

which gives the output:

            Beginning1  Protein2    Protein3      Protein4      Biomarker1
Pathway3    G           NA           NA           F,E
Pathway9    G           Z            H            F,E
Pathway2    G           D            NA           F,E
Pathway1    D           K            NA           F
Pathway7    B           C            D            F,E

Does anyone know how to fix this problem? I have also tried duplicating the solution from Collapse / concatenate / aggregate a column to a single comma separated string within each group, but have had no success.

I am sorry if it is a simple error- I am pretty new to R.

like image 999
Taylor Maurer Avatar asked Jul 25 '17 17:07

Taylor Maurer


3 Answers

Here's a possible solution using dplyr

df %>% group_by_at(vars(Protein2:Protein4)) %>%
  summarize_all(paste, collapse=",")
like image 185
MrFlick Avatar answered Oct 21 '22 17:10

MrFlick


Using data.table you can use .SD to refer to all columns not specified in the by argument. Then we can use lapply to accomplish the paste() with collapse.

library(data.table)
dt <- read.table(text = "Beginning1 Protein2    Protein3    Biomarker1
                  A         G           NA           NA           F
                  Z         G           NA           NA           E
                  A         G           Z            H            F
                  Y         G           Z            H            E
                  A         G           D            NA           F
                  Q         G           D            NA           E
                  A         D           K            NA           F
                  A         B           C            D            F
                  V         B           C            D            E",header = T)
dt <- data.table(dt)
dt[,lapply(.SD, function(col) paste(col, collapse=", ")), 
    by=.(Protein2, Protein3, Protein4)]

Output

   Protein2 Protein3 Protein4 Beginning1 Biomarker1
1:        G       NA       NA       A, Z       F, E
2:        G        Z        H       A, Y       F, E
3:        G        D       NA       A, Q       F, E
4:        D        K       NA          A          F
5:        B        C        D       A, V       F, E
like image 32
Adam Spannbauer Avatar answered Oct 21 '22 16:10

Adam Spannbauer


We can use aggregate from base R

r1 <- aggregate(cbind(Beginning1, Biomarker1)~., replace(df,is.na(df), "NA"), FUN = toString)
r1
#    Protein2 Protein3 Protein4 Beginning1 Biomarker1
#1        B        C        D       A, V       F, E
#2        G        Z        H       A, Y       F, E
#3        G        D       NA       A, Q       F, E
#4        D        K       NA          A          F
#5        G       NA       NA       A, Z       F, E
r1[r1=="NA"] <- NA
like image 31
akrun Avatar answered Oct 21 '22 17:10

akrun