Consolidate duplicate columns and concatenate column names

Question

I'm sure this must be a duplicate. Say I have this:

stage <- seq(1,3,1)
exp1 <- c("A","B","C")
exp2 <- c("A","B","C")
exp3 <- c(NA, "B","C")
exp4 <- c("D","B","C")
exp5 <- c("A","B","C")
exp6 <- c(NA, "B","C")

df <- data.frame(stage, exp1, exp2, exp3, exp4, exp5, exp6)

  stage exp1 exp2 exp3 exp4 exp5 exp6
1     1    A    A <NA>    D    A <NA>
2     2    B    B    B    B    B    B
3     3    C    C    C    C    C    C

I want to combine all duplicated columns and show this by concatenating the column names. I can find duplicated columns with:

df[duplicated(lapply(df, summary))]

exp2 exp5 exp6
1    A    A <NA>
2    B    B    B
3    C    C    C

But I can't figure out how to combine the duplicated column names such that I get something like this:

  stage exp1_exp2_exp5 exp3_exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

Perhaps (probably) this is a poor way to visualize a comparison between different "exp"s? Maybe I need to reshape to show this more clearly?

lmo · Accepted Answer

With base R, you can use match on the unique set of vectors in the data.frame to get the groupings, feed this to split, with the names of the variables as the first argument to break up the names by group, use sapplyand paste to concatenate the variable names in each group, and then use setNames to provide the names to the unique set of columns.

setNames(as.data.frame(unique(as.list(df))),
         sapply(split(names(df), match(as.list(df), unique(as.list(df)))),
                paste, collapse="-"))

which returns

  stage exp1-exp2-exp5 exp3-exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

Jaap · Answer

Looking for something like this?

library(dplyr)
library(tidyr)
df %>% 
  gather(variable, value, -1) %>% 
  group_by(variable) %>% 
  summarise(values = paste(sort(value), collapse = ',')) %>% 
  group_by(values) %>% 
  summarise(cols = paste(variable, collapse = '_')) %>% 
  separate_rows(values) %>% 
  left_join(df %>% gather(variable, value, -1, na.rm = TRUE), ., by = c('value'='values')) %>% 
  select(-variable) %>% 
  distinct() %>% 
  spread(cols, value)

which gives:

  stage exp1_exp2_exp5 exp3_exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

Consolidate duplicate columns and concatenate column names

Tags:

dataframe

r

duplicates

Pete900

2 Answers

lmo

Jaap

Recent Activity

Donate For Us

Consolidate duplicate columns and concatenate column names

Tags:

dataframe

r

duplicates

Pete900

2 Answers

lmo

Jaap

Related questions

Recent Activity

Donate For Us