Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapsing data frame by selecting one row per group

Tags:

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.

For example, I'd like to convert this

> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17)) > d   x  y  z 1 1 10 20 2 1 11 19 3 2 12 18 4 4 13 17 

Into this:

    x  y  z 1   1 11 19 2   2 12 18 3   4 13 17 

I'm using aggregate to do this currently, but the performance is unacceptable with more data:

> d.ordered = d[order(-d$y),] > aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]}) 

I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.

Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?

like image 825
jkebinger Avatar asked Apr 13 '10 02:04

jkebinger


People also ask

What does the collapse function do in R?

collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are: To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.

Is tidyr in tidyverse?

Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.


1 Answers

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]   x  y  z 1 1 10 20 3 2 12 18 4 4 13 17 R>  

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))   x  y  z 1 1 11 19 2 2 12 18 3 4 13 17 R>  

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

like image 115
Dirk Eddelbuettel Avatar answered Nov 22 '22 05:11

Dirk Eddelbuettel