From a dataframe like this
test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10 > test id string 1 1 A 2 1 F 3 2 B 4 2 G 5 3 C 6 3 H 7 4 D 8 4 I 9 5 E 10 5 J
I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:
res <- sqldf("select id, min(rownames(test)), string from test group by id, string") > res id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E
Is there a solution short of creating a new column like
test$row <- rownames(test)
and running the same sqldf query with min(row)?
To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).
To return only the first row that matches your SELECT query, you need to add the LIMIT clause to your SELECT statement. The LIMIT clause is used to control the number of rows returned by your query. When you add LIMIT 1 to the SELECT statement, then only one row will be returned.
You can use duplicated
to do this very quickly.
test[!duplicated(test$id),]
Benchmarks, for the speed freaks:
ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() { testd <- as.data.table(test) setkey(testd,id) # Initial solution (slow) # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)] # Faster options : testd[!duplicated(id)] # (1) # testd[, .SD[1L], by=key(testd)] # (2) # testd[J(unique(id)),mult="first"] # (3) # testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc } library(plyr) library(data.table) library(rbenchmark) # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] benchmark(ju(), gs1(), gs2(), jply(), jdt(), replications=5, order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 5 0.03 1.000 0.03 0.00 # 5 jdt() 5 0.03 1.000 0.03 0.00 # 3 gs2() 5 3.49 116.333 2.87 0.58 # 2 gs1() 5 3.58 119.333 3.00 0.58 # 4 jply() 5 3.69 123.000 3.11 0.51
Let's try that again, but with just the contenders from the first heat and with more data and more replications.
set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 100 5.48 1.000 4.44 1.00 # 2 jdt() 100 6.92 1.263 5.70 1.15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With