Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transpose duplicated rows to column in R

I have a large data.frame (20000+ entries) in this format:

id  D1      D2
1   0.40    0.21
1   0.00    0.00
1   0.53    0.20
2   0.17    0.17
2   0.25    0.25
2   0.55    0.43

Where each id may be duplicated 3-20 times. I would like to merge the duplicated rows into new columns, so my new data.frame looks like:

id  D1      D2      D3      D4      D5      D6
1   0.40    0.21    0.00    0.00    0.53    0.20
2   0.17    0.17    0.25    0.25    0.55    0.43

I've manipulated data.frames before with plyr, but I'm not sure how to approach this problem. Any help would be appreciated.Thanks.

like image 396
user2624830 Avatar asked Jul 27 '13 04:07

user2624830


People also ask

How do I transpose rows to columns in R?

To interchange rows with columns, you can use the t() function. For example, if you have the matrix (or dataframe) mat you can transpose it by typing t(mat) . This will, as previously hinted, result in a new matrix that is obtained by exchanging the rows and columns.

How do I remove duplicate rows in R?

R base provides duplicated() and unique() functions to remove duplicates in an R DataFrame (data. frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.

How do I fix duplicate row names in R?

It is not possible to have duplicate row names, but a simple workaround is creating an extra column (e.g. label) that holds the name that you would assign to your rows. You can then use this column for the names in the graph instead.

How do I reduce duplicates in R?

Remove duplicate rows in a data frameThe function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It's an efficient version of the R base function unique() .


1 Answers

The best option would be to just use melt and dcast from "reshape2". But before we jump to that option, let's see what else we have available to us:


You mention that the number of rows per "id" is unbalanced. That would make it somewhat difficult to put into a tidy rectangular data.frame.

Here are a few examples.

Balanced data: Three rows per "id"

mydf <- structure(list(id = c(1, 1, 1, 2, 2, 2), 
                       D1 = c(0.4, 0, 0.53, 0.17, 0.25, 0.55), 
                       D2 = c(0.21, 0, 0.2, 0.17, 0.25, 0.43)), 
                  .Names = c("id", "D1", "D2"), row.names = c(NA, 6L), 
                  class = "data.frame")
mydf
#   id   D1   D2
# 1  1 0.40 0.21
# 2  1 0.00 0.00
# 3  1 0.53 0.20
# 4  2 0.17 0.17
# 5  2 0.25 0.25
# 6  2 0.55 0.43

With such data, you can just use aggregate:

do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
#   id D1.1 D1.2 D1.3 D2.1 D2.2 D2.3
# 1  1 0.40 0.00 0.53 0.21 0.00 0.20
# 2  2 0.17 0.25 0.55 0.17 0.25 0.43

Unbalanced data: Some workarounds

If you've added a fourth value for "id = 2", aggregate won't work here:

mydf[7, ] <- c(2, .44, .33)
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# Error in data.frame(`0` = c(0.4, 0, 0.53), `1` = c(0.17, 0.25, 0.55, 0.44 : 
#   arguments imply differing number of rows: 3, 4

It might be best to just have a list of the resulting vectors:

lapply(split(mydf[-1], mydf[[1]]), function(x) unlist(x, use.names=FALSE))
# $`1`
# [1] 0.40 0.00 0.53 0.21 0.00 0.20
# 
# $`2`
# [1] 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
# 

Or, if you insist on a rectangular data.frame, explore one of the several tools to rbind unbalanced data, for example, rbind.fill from "plyr":

library(plyr)
rbind.fill(lapply(split(mydf[-1], mydf[[1]]), 
                  function(x) data.frame(t(unlist(x, use.names=FALSE)))))
#     X1   X2   X3   X4   X5   X6   X7   X8
# 1 0.40 0.00 0.53 0.21 0.00 0.20   NA   NA
# 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33

Unbalanced data: A more direct approach

Alternatively, you can use melt and dcast from "reshape2" as follows:

library(reshape2)
x <- melt(mydf, id.vars = "id")
## ^^ That's not enough information for `dcast`
##    We need a "time" variable too, so use `ave`
##      to create one according to the number of
##      values per ID.
x$time <- ave(x$id, x$id, FUN = seq_along)
## ^^ I would probably actually stop at this point.
##    Long data with proper ID and "time" values
##      tend to be easier to work with and many
##      other functions in R work more nicely with
##      this long data format.
dcast(x, id ~ time, value.var = "value")
#   id    1    2    3    4    5    6    7    8
# 1  1 0.40 0.00 0.53 0.21 0.00 0.20   NA   NA
# 2  2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
like image 127
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 14 '22 00:10

A5C1D2H2I1M1N2O1R2T1