Combining (pasting) columns

Question

I have the following data.frame

Tipo Start  End Strand Accesion1 Accesion2
1 gene   197 1558      +      <NA>   SP_0001
2  CDS   197 1558      + NP_344554      <NA>
3 gene  1717 2853      +      <NA>   SP_0002
4  CDS  1717 2853      + NP_344555      <NA>
5 gene  2864 3112      +      <NA>   SP_0003
6  CDS  2864 3112      + NP_344556      <NA>

There are more "Tipo" values, such as tRNA, region , exon, or rRNA, but I am only interested in combining these two, gene and CDS

And I would like to get the following

Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001

but only when the start and End values of gene and CDS coincide. I've tried to use select, arrange and mutate with dplyr, but it is sort of complicated for me to get rid of the NAs

BrodieG · Accepted Answer

A dplyr version with summarize_each:

DF %>% 
  group_by(Start, End) %>% 
  summarise_each(funs(max), Accesion1, Accesion2)

Produces:

Source: local data frame [3 x 4]
Groups: Start

  Start  End Accesion1 Accesion2
1   197 1558 NP_344554   SP_0001
2  1717 2853 NP_344555   SP_0002
3  2864 3112 NP_344556   SP_0003

Assumes AccessionX varibles are character (does not work with factor), as well as the condition that Start End pairs contain only two values, one each of Tipo and Gene, as in your data set.

akrun · Answer

You could try

library(data.table)
setDT(df1)[, id:=cumsum(Tipo == 'gene')][,
   list(Accesion1=na.omit(Accesion1), Accesion2=na.omit(Accesion2)) ,
                              list(id, Start, End)]

bgoldst · Answer

Here's a solution using aggregate():

df <- data.frame(Tipo=c('gene','CDS','gene','CDS','gene','CDS'), Start=c(197,197,1717,1717,2864,2864), End=c(1558,1558,2853,2853,3112,3112), Strand=c('+','+','+','+','+','+'), Accesion1=c(NA,'NP_344554',NA,'NP_344555',NA,'NP_344556'), Accesion2=c('SP_0001',NA,'SP_0002',NA,'SP_0003',NA) );
df2 <- df[df$Tipo%in%c('gene','CDS'),c('Start','End','Accesion1','Accesion2')];
aggregate(df2[,c('Accesion1','Accesion2')], df2[,c('Start','End')], function(x) x[!is.na(x)] );
##   Start  End Accesion1 Accesion2
## 1   197 1558 NP_344554   SP_0001
## 2  1717 2853 NP_344555   SP_0002
## 3  2864 3112 NP_344556   SP_0003

Precomputing df2 is necessary in case there are non-gene non-CDS rows in the original data.frame; in order to properly aggregate just the gene and CDS rows, the non-gene non-CDS rows must be excluded from both x and by. (Of course, your example data has only gene and CDS rows, so it's not technically necessary for the example data.)

This solution makes the assumption that whenever two rows have the same Start and End values, then they must be gene/CDS pairs (as opposed to gene/gene or CDS/CDS).

jazzurro · Answer

Here is one potential way. You choose rows with gene and CDS. Then, you group your data by Start and END. There may be groups of START/END with 1 or 3+ rows. So you want to make sure that you choose START/END groups with two rows. In addition, you want to make sure that you have both gene and CDS (length(unique(Tipo)) == 2). Finally, you take non-NA element in Accesion1 and Accesion 2.

filter(df, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
          Accesion2 = Accesion2[!is.na(Accesion2)])

Here is a pseudo example.

mydf <- structure(list(Tipo = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("CDS", 
"gene"), class = "factor"), Start = c(197, 197, 1717, 1717, 2864, 
2864), End = c(1558, 1558, 2853, 2853, 3112, 3112), Strand = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor"), Accesion1 = structure(c(NA, 
1L, NA, 2L, NA, 3L), .Label = c("NP_344554", "NP_344555", "NP_344556"
), class = "factor"), Accesion2 = structure(c(1L, NA, 2L, NA, 
3L, NA), .Label = c("SP_0001", "SP_0002", "SP_0003"), class = "factor")), .Names = c("Tipo", 
"Start", "End", "Strand", "Accesion1", "Accesion2"), row.names = c(NA, 
-6L), class = "data.frame")


  Tipo Start  End Strand Accesion1 Accesion2
1 gene   197 1558      +      <NA>   SP_0001
2  CDS   197 1558      + NP_344554      <NA>
3 gene  1717 2853      +      <NA>   SP_0002
4  CDS  1717 2853      + NP_344555      <NA>
5 gene  2864 3112      +      <NA>   SP_0003
6 gene  2864 3112      + NP_344556      <NA>


filter(mydf, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
          Accesion2 = Accesion2[!is.na(Accesion2)])

#  Start  End Accesion1 Accesion2
#1   197 1558 NP_344554   SP_0001
#2  1717 2853 NP_344555   SP_0002

Combining (pasting) columns

Tags:

r

dplyr

Antonio Rodriguez Franco

4 Answers

BrodieG

akrun

bgoldst

jazzurro

Recent Activity

Donate For Us

Combining (pasting) columns

Tags:

r

dplyr

Antonio Rodriguez Franco

4 Answers

BrodieG

akrun

bgoldst

jazzurro

Related questions

Recent Activity

Donate For Us