Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: combine same identifiers in dataframe

Tags:

dataframe

r

I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).

 ID           Names
uc001aag.1  DKFZp686C24272
uc001aag.1  DQ786314
uc001aag.1  uc001aag.1
uc001aah.2  AK056232
uc001aah.2  FLJ00038
uc001aah.2  uc001aah.1
uc001aah.2  uc001aah.2
uc001aai.1  AY217347

Now I want to create a dataframe like this:

 ID           Names
uc001aag.1  DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2  AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1  AY217347

Can anyone help me?

like image 986
Lisann Avatar asked Feb 25 '23 06:02

Lisann


2 Answers

Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall :

require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")

ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
    paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()

The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :

n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
                    Names=rep(LETTERS[1:m],each=n))
 # using snowfall for parallel sapply    
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sfSapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 }) 
   user  system elapsed 
   0.02    0.00    0.33 

 # using tapply
 system.time({
   CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
   data.frame(ID=names(CombNames),CombNames)
 })
   user  system elapsed 
   0.44    0.00    0.44 

 # using aggregate
 system.time(
   aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
 )
   user  system elapsed 
   0.47    0.00    0.47 

 # using the normal sapply
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 })
   user  system elapsed 
   0.75    0.00    0.75 

Note:

For the record, the better sapply-solution would be :

CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)

which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.

like image 114
Joris Meys Avatar answered Feb 26 '23 18:02

Joris Meys


You can use aggregate:

R> aggregate(Names ~ ID, data=tmp, FUN=paste, collapse=" | ")
          ID                                         Names
1 uc001aag.1        DKFZp686C24272 | DQ786314 | uc001aag.1
2 uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
3 uc001aai.1                                      AY217347
like image 38
rcs Avatar answered Feb 26 '23 19:02

rcs