I have a large data.frame (20000+ entries) in this format:
id D1 D2
1 0.40 0.21
1 0.00 0.00
1 0.53 0.20
2 0.17 0.17
2 0.25 0.25
2 0.55 0.43
Where each id may be duplicated 3-20 times. I would like to merge the duplicated rows into new columns, so my new data.frame looks like:
id D1 D2 D3 D4 D5 D6
1 0.40 0.21 0.00 0.00 0.53 0.20
2 0.17 0.17 0.25 0.25 0.55 0.43
I've manipulated data.frames before with plyr, but I'm not sure how to approach this problem. Any help would be appreciated.Thanks.
To interchange rows with columns, you can use the t() function. For example, if you have the matrix (or dataframe) mat you can transpose it by typing t(mat) . This will, as previously hinted, result in a new matrix that is obtained by exchanging the rows and columns.
R base provides duplicated() and unique() functions to remove duplicates in an R DataFrame (data. frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.
It is not possible to have duplicate row names, but a simple workaround is creating an extra column (e.g. label) that holds the name that you would assign to your rows. You can then use this column for the names in the graph instead.
Remove duplicate rows in a data frameThe function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It's an efficient version of the R base function unique() .
The best option would be to just use melt
and dcast
from "reshape2". But before we jump to that option, let's see what else we have available to us:
You mention that the number of rows per "id" is unbalanced. That would make it somewhat difficult to put into a tidy rectangular data.frame
.
Here are a few examples.
mydf <- structure(list(id = c(1, 1, 1, 2, 2, 2),
D1 = c(0.4, 0, 0.53, 0.17, 0.25, 0.55),
D2 = c(0.21, 0, 0.2, 0.17, 0.25, 0.43)),
.Names = c("id", "D1", "D2"), row.names = c(NA, 6L),
class = "data.frame")
mydf
# id D1 D2
# 1 1 0.40 0.21
# 2 1 0.00 0.00
# 3 1 0.53 0.20
# 4 2 0.17 0.17
# 5 2 0.25 0.25
# 6 2 0.55 0.43
With such data, you can just use aggregate
:
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# id D1.1 D1.2 D1.3 D2.1 D2.2 D2.3
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20
# 2 2 0.17 0.25 0.55 0.17 0.25 0.43
If you've added a fourth value for "id = 2", aggregate
won't work here:
mydf[7, ] <- c(2, .44, .33)
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# Error in data.frame(`0` = c(0.4, 0, 0.53), `1` = c(0.17, 0.25, 0.55, 0.44 :
# arguments imply differing number of rows: 3, 4
It might be best to just have a list
of the resulting vector
s:
lapply(split(mydf[-1], mydf[[1]]), function(x) unlist(x, use.names=FALSE))
# $`1`
# [1] 0.40 0.00 0.53 0.21 0.00 0.20
#
# $`2`
# [1] 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
#
Or, if you insist on a rectangular data.frame
, explore one of the several tools to rbind
unbalanced data, for example, rbind.fill
from "plyr":
library(plyr)
rbind.fill(lapply(split(mydf[-1], mydf[[1]]),
function(x) data.frame(t(unlist(x, use.names=FALSE)))))
# X1 X2 X3 X4 X5 X6 X7 X8
# 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
Alternatively, you can use melt
and dcast
from "reshape2" as follows:
library(reshape2)
x <- melt(mydf, id.vars = "id")
## ^^ That's not enough information for `dcast`
## We need a "time" variable too, so use `ave`
## to create one according to the number of
## values per ID.
x$time <- ave(x$id, x$id, FUN = seq_along)
## ^^ I would probably actually stop at this point.
## Long data with proper ID and "time" values
## tend to be easier to work with and many
## other functions in R work more nicely with
## this long data format.
dcast(x, id ~ time, value.var = "value")
# id 1 2 3 4 5 6 7 8
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With