Given a sortkey, is there a data.table shortcut to duplicate the first
and last
functionalities found in SAS and SPSS ?
The pedestrian approach below flags the first record of a group.
Given the elegance of data.table (with which I'm slowly getting familiar), I'm assuming there's a shortcut using a self join & mult
, but I'm still trying to figure it out.
Here's the example:
require(data.table) set.seed(123) n <- 17 DT <- data.table(x=sample(letters[1:3],n,replace=T), y=sample(LETTERS[1:3],n,replace=T)) sortkey <- c("x","y") setkeyv(DT,sortkey) key <- paste(DT$x,DT$y,sep="-") nw <- c( T , key[2:n]!=key[1:(n-1)] ) DT$first <- 1*nw DT
The group by will always return the first record in the group on the result set. SELECT id, category_id, post_title FROM posts WHERE id IN ( SELECT MAX(id) FROM posts GROUP BY category_id ); This will return the posts with the highest IDs in each group.
Retrieving the last record in each group using GROUP BY There are two solutions explained here using the GROUP BY clause. In both these solutions, we will be using the MAX() function to get the maximum value of id and then retrieving the other columns corresponding to this maximum id.
One easy way is to use the duplicated()
function. When applied to a data-frame, it produces a vector where an entry is TRUE if and only if the row value combination has not occurred before, when moving down the data-frame.
DT$first <- !duplicated( DT[, list(x,y) ]) DT$last <- rev(!duplicated( DT[, list(rev(x),rev(y)) ])) > DT x y first last [1,] a A TRUE TRUE [2,] a B TRUE TRUE [3,] a C TRUE FALSE [4,] a C FALSE TRUE [5,] b A TRUE TRUE [6,] b B TRUE TRUE [7,] b C TRUE FALSE [8,] b C FALSE FALSE [9,] b C FALSE FALSE [10,] b C FALSE TRUE [11,] c A TRUE FALSE [12,] c A FALSE TRUE [13,] c B TRUE FALSE [14,] c B FALSE TRUE [15,] c C TRUE FALSE [16,] c C FALSE FALSE [17,] c C FALSE TRUE
Another way without using duplicated()
is:
DT[ unique(DT), list(first = c(1, rep(0,length(y)-1)), last = c(rep(0,length(y)-1),1 )) ] x y first last [1,] a A 1 1 [2,] a B 1 1 [3,] a C 1 0 [4,] a C 0 1 [5,] b A 1 1 [6,] b B 1 1 [7,] b C 1 0 [8,] b C 0 0 [9,] b C 0 0 [10,] b C 0 1 [11,] c A 1 0 [12,] c A 0 1 [13,] c B 1 0 [14,] c B 0 1 [15,] c C 1 0 [16,] c C 0 0 [17,] c C 0 1
Here are couple of solutions using data.table
:
## Option 1 (cleaner solution, added 2016-11-29) uDT <- unique(DT) DT[, c("first","last"):=0L] DT[uDT, first:=1L, mult="first"] DT[uDT, last:=1L, mult="last"] ## Option 2 (original answer, retained for posterity) DT <- cbind(DT, first=0L, last=0L) DT[DT[unique(DT),,mult="first", which=TRUE], first:=1L] DT[DT[unique(DT),,mult="last", which=TRUE], last:=1L] head(DT) # x y first last # [1,] a A 1 1 # [2,] a B 1 1 # [3,] a C 1 0 # [4,] a C 0 1 # [5,] b A 1 1 # [6,] b B 1 1
There's obviously a lot packed into each of those lines. The key construct, though, is the following, which returns the row index of the first record in each group:
DT[unique(DT),,mult="first", which=TRUE] # [1] 1 2 3 5 6 7 11 13 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With