From a dataframe like this <pre class="prettyprint"><code>test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10 > test id string 1 1 A 2 1 F 3 2 B 4 2 G 5 3 C 6 3 H 7 4 D 8 4 I 9 5 E 10 5 J </code></pre> I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this: <pre class="prettyprint"><code>res <- sqldf("select id, min(rownames(test)), string from test group by id, string") > res id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E </code></pre> Is there a solution short of creating a new column like <pre class="prettyprint"><code>test$row <- rownames(test) </code></pre> and running the same sqldf query with min(row)?

You can use <code>duplicated</code> to do this very quickly. <pre class="prettyprint"><code>test[!duplicated(test$id),] </code></pre> Benchmarks, for the speed freaks: <pre class="prettyprint"><code>ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() { testd <- as.data.table(test) setkey(testd,id) # Initial solution (slow) # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)] # Faster options : testd[!duplicated(id)] # (1) # testd[, .SD[1L], by=key(testd)] # (2) # testd[J(unique(id)),mult="first"] # (3) # testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc } library(plyr) library(data.table) library(rbenchmark) # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] benchmark(ju(), gs1(), gs2(), jply(), jdt(), replications=5, order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 5 0.03 1.000 0.03 0.00 # 5 jdt() 5 0.03 1.000 0.03 0.00 # 3 gs2() 5 3.49 116.333 2.87 0.58 # 2 gs1() 5 3.58 119.333 3.00 0.58 # 4 jply() 5 3.69 123.000 3.11 0.51 </code></pre> Let's try that again, but with just the contenders from the first heat and with more data and more replications. <pre class="prettyprint"><code>set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 100 5.48 1.000 4.44 1.00 # 2 jdt() 100 6.92 1.263 5.70 1.15 </code></pre>

Select the first row by group

Tags:

dataframe

r

sqldf

From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10  > test     id string  1   1      A  2   1      F  3   2      B  4   2      G  5   3      C  6   3      H  7   4      D  8   4      I  9   5      E  10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string                from test                group by id, string")  > res     id string  1   1      A  3   2      B  5   3      C  7   4      D  9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?

979

asked Nov 07 '12 22:11

dmvianna

1 Answers

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() {   testd <- as.data.table(test)   setkey(testd,id)   # Initial solution (slow)   # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]   # Faster options :   testd[!duplicated(id)]               # (1)   # testd[, .SD[1L], by=key(testd)]    # (2)   # testd[J(unique(id)),mult="first"]  # (3)   # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc }  library(plyr) library(data.table) library(rbenchmark)  # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ]  benchmark(ju(), gs1(), gs2(), jply(), jdt(),     replications=5, order="relative")[,1:6] #     test replications elapsed relative user.self sys.self # 1   ju()            5    0.03    1.000      0.03     0.00 # 5  jdt()            5    0.03    1.000      0.03     0.00 # 3  gs2()            5    3.49  116.333      2.87     0.58 # 2  gs1()            5    3.58  119.333      3.00     0.58 # 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] #    test replications elapsed relative user.self sys.self # 1  ju()          100    5.48    1.000      4.44     1.00 # 2 jdt()          100    6.92    1.263      5.70     1.15

146

answered Sep 23 '22 05:09

Joshua Ulrich

Related questions
                            
                                How to source R Markdown file like `source('myfile.r')`?
                            
                                Why does unlist() kill dates in R?
                            
                                Error: C stack usage is too close to the limit
                            
                                how to increase the limit for max.print in R
                            
                                Choosing between qplot() and ggplot() in ggplot2 [closed]
                            
                                Select / assign to data.table when variable names are stored in a character vector
                            
                                write.table writes unwanted leading empty column to header when has rownames
                            
                                How can I extract plot axes' ranges for a ggplot2 object?
                            
                                Create a variable name with "paste" in R?
                            
                                Parse JSON with R
                            
                                Rstudio rmarkdown: both portrait and landscape layout in a single PDF
                            
                                Quit and restart a clean R session from within R?
                            
                                Shading a kernel density plot between two points.
                            
                                Case Statement Equivalent in R
                            
                                How to display only integer values on an axis using ggplot2
                            
                                What's the fastest way to merge/join data.frames in R?
                            
                                How to specify names of columns for x and y when joining in dplyr?
                            
                                How to create an empty R vector to add new items
                            
                                Apply several summary functions on several variables by group in one call
                            
                                Remove part of a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With