I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17)) > d x y z 1 1 10 20 2 1 11 19 3 2 12 18 4 4 13 17
Into this:
x y z 1 1 11 19 2 2 12 18 3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),] > aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are: To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.
Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.
Maybe duplicated()
can help:
R> d[ !duplicated(d$x), ] x y z 1 1 10 20 3 2 12 18 4 4 13 17 R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1)) x y z 1 1 11 19 2 2 12 18 3 4 13 17 R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z
using tail(z, 1)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With