Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select the first row by group

Tags:

dataframe

r

sqldf

From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10  > test     id string  1   1      A  2   1      F  3   2      B  4   2      G  5   3      C  6   3      H  7   4      D  8   4      I  9   5      E  10  5      J 

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string                from test                group by id, string")  > res     id string  1   1      A  3   2      B  5   3      C  7   4      D  9   5      E 

Is there a solution short of creating a new column like

test$row <- rownames(test) 

and running the same sqldf query with min(row)?

like image 979
dmvianna Avatar asked Nov 07 '12 22:11

dmvianna


People also ask

How do I select the first row in a GROUP BY a group?

To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).

How do I select only the first row?

To return only the first row that matches your SELECT query, you need to add the LIMIT clause to your SELECT statement. The LIMIT clause is used to control the number of rows returned by your query. When you add LIMIT 1 to the SELECT statement, then only one row will be returned.


1 Answers

You can use duplicated to do this very quickly.

test[!duplicated(test$id),] 

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() {   testd <- as.data.table(test)   setkey(testd,id)   # Initial solution (slow)   # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]   # Faster options :   testd[!duplicated(id)]               # (1)   # testd[, .SD[1L], by=key(testd)]    # (2)   # testd[J(unique(id)),mult="first"]  # (3)   # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc }  library(plyr) library(data.table) library(rbenchmark)  # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ]  benchmark(ju(), gs1(), gs2(), jply(), jdt(),     replications=5, order="relative")[,1:6] #     test replications elapsed relative user.self sys.self # 1   ju()            5    0.03    1.000      0.03     0.00 # 5  jdt()            5    0.03    1.000      0.03     0.00 # 3  gs2()            5    3.49  116.333      2.87     0.58 # 2  gs1()            5    3.58  119.333      3.00     0.58 # 4 jply()            5    3.69  123.000      3.11     0.51 

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] #    test replications elapsed relative user.self sys.self # 1  ju()          100    5.48    1.000      4.44     1.00 # 2 jdt()          100    6.92    1.263      5.70     1.15 
like image 146
Joshua Ulrich Avatar answered Sep 23 '22 05:09

Joshua Ulrich