Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining (pasting) columns

Tags:

r

dplyr

I have the following data.frame

Tipo Start  End Strand Accesion1 Accesion2
1 gene   197 1558      +      <NA>   SP_0001
2  CDS   197 1558      + NP_344554      <NA>
3 gene  1717 2853      +      <NA>   SP_0002
4  CDS  1717 2853      + NP_344555      <NA>
5 gene  2864 3112      +      <NA>   SP_0003
6  CDS  2864 3112      + NP_344556      <NA>

There are more "Tipo" values, such as tRNA, region , exon, or rRNA, but I am only interested in combining these two, gene and CDS

And I would like to get the following

Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001

but only when the start and End values of gene and CDS coincide. I've tried to use select, arrange and mutate with dplyr, but it is sort of complicated for me to get rid of the NAs

like image 972
Antonio Rodriguez Franco Avatar asked Apr 20 '15 15:04

Antonio Rodriguez Franco


4 Answers

A dplyr version with summarize_each:

DF %>% 
  group_by(Start, End) %>% 
  summarise_each(funs(max), Accesion1, Accesion2)

Produces:

Source: local data frame [3 x 4]
Groups: Start

  Start  End Accesion1 Accesion2
1   197 1558 NP_344554   SP_0001
2  1717 2853 NP_344555   SP_0002
3  2864 3112 NP_344556   SP_0003

Assumes AccessionX varibles are character (does not work with factor), as well as the condition that Start End pairs contain only two values, one each of Tipo and Gene, as in your data set.

like image 104
BrodieG Avatar answered Oct 19 '22 19:10

BrodieG


You could try

library(data.table)
setDT(df1)[, id:=cumsum(Tipo == 'gene')][,
   list(Accesion1=na.omit(Accesion1), Accesion2=na.omit(Accesion2)) ,
                              list(id, Start, End)]
like image 36
akrun Avatar answered Oct 19 '22 20:10

akrun


Here's a solution using aggregate():

df <- data.frame(Tipo=c('gene','CDS','gene','CDS','gene','CDS'), Start=c(197,197,1717,1717,2864,2864), End=c(1558,1558,2853,2853,3112,3112), Strand=c('+','+','+','+','+','+'), Accesion1=c(NA,'NP_344554',NA,'NP_344555',NA,'NP_344556'), Accesion2=c('SP_0001',NA,'SP_0002',NA,'SP_0003',NA) );
df2 <- df[df$Tipo%in%c('gene','CDS'),c('Start','End','Accesion1','Accesion2')];
aggregate(df2[,c('Accesion1','Accesion2')], df2[,c('Start','End')], function(x) x[!is.na(x)] );
##   Start  End Accesion1 Accesion2
## 1   197 1558 NP_344554   SP_0001
## 2  1717 2853 NP_344555   SP_0002
## 3  2864 3112 NP_344556   SP_0003

Precomputing df2 is necessary in case there are non-gene non-CDS rows in the original data.frame; in order to properly aggregate just the gene and CDS rows, the non-gene non-CDS rows must be excluded from both x and by. (Of course, your example data has only gene and CDS rows, so it's not technically necessary for the example data.)

This solution makes the assumption that whenever two rows have the same Start and End values, then they must be gene/CDS pairs (as opposed to gene/gene or CDS/CDS).

like image 3
bgoldst Avatar answered Oct 19 '22 21:10

bgoldst


Here is one potential way. You choose rows with gene and CDS. Then, you group your data by Start and END. There may be groups of START/END with 1 or 3+ rows. So you want to make sure that you choose START/END groups with two rows. In addition, you want to make sure that you have both gene and CDS (length(unique(Tipo)) == 2). Finally, you take non-NA element in Accesion1 and Accesion 2.

filter(df, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
          Accesion2 = Accesion2[!is.na(Accesion2)])

Here is a pseudo example.

mydf <- structure(list(Tipo = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("CDS", 
"gene"), class = "factor"), Start = c(197, 197, 1717, 1717, 2864, 
2864), End = c(1558, 1558, 2853, 2853, 3112, 3112), Strand = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor"), Accesion1 = structure(c(NA, 
1L, NA, 2L, NA, 3L), .Label = c("NP_344554", "NP_344555", "NP_344556"
), class = "factor"), Accesion2 = structure(c(1L, NA, 2L, NA, 
3L, NA), .Label = c("SP_0001", "SP_0002", "SP_0003"), class = "factor")), .Names = c("Tipo", 
"Start", "End", "Strand", "Accesion1", "Accesion2"), row.names = c(NA, 
-6L), class = "data.frame")


  Tipo Start  End Strand Accesion1 Accesion2
1 gene   197 1558      +      <NA>   SP_0001
2  CDS   197 1558      + NP_344554      <NA>
3 gene  1717 2853      +      <NA>   SP_0002
4  CDS  1717 2853      + NP_344555      <NA>
5 gene  2864 3112      +      <NA>   SP_0003
6 gene  2864 3112      + NP_344556      <NA>


filter(mydf, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
          Accesion2 = Accesion2[!is.na(Accesion2)])

#  Start  End Accesion1 Accesion2
#1   197 1558 NP_344554   SP_0001
#2  1717 2853 NP_344555   SP_0002
like image 2
jazzurro Avatar answered Oct 19 '22 20:10

jazzurro