Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using data.table to flag the first (or last) record in a group

Tags:

Given a sortkey, is there a data.table shortcut to duplicate the first and last functionalities found in SAS and SPSS ?

The pedestrian approach below flags the first record of a group.

Given the elegance of data.table (with which I'm slowly getting familiar), I'm assuming there's a shortcut using a self join & mult, but I'm still trying to figure it out.

Here's the example:

require(data.table)  set.seed(123) n <- 17 DT <- data.table(x=sample(letters[1:3],n,replace=T),                  y=sample(LETTERS[1:3],n,replace=T)) sortkey  <- c("x","y") setkeyv(DT,sortkey) key <- paste(DT$x,DT$y,sep="-") nw <- c( T , key[2:n]!=key[1:(n-1)] ) DT$first <- 1*nw DT 
like image 372
M.Dimo Avatar asked May 06 '12 21:05

M.Dimo


People also ask

How do I get last record by group by?

The group by will always return the first record in the group on the result set. SELECT id, category_id, post_title FROM posts WHERE id IN ( SELECT MAX(id) FROM posts GROUP BY category_id ); This will return the posts with the highest IDs in each group.

How do I get the latest record of each ID in SQL?

Retrieving the last record in each group using GROUP BY There are two solutions explained here using the GROUP BY clause. In both these solutions, we will be using the MAX() function to get the maximum value of id and then retrieving the other columns corresponding to this maximum id.


2 Answers

One easy way is to use the duplicated() function. When applied to a data-frame, it produces a vector where an entry is TRUE if and only if the row value combination has not occurred before, when moving down the data-frame.

DT$first <- !duplicated( DT[, list(x,y) ])                                                                                                                                                                                                                                     DT$last <- rev(!duplicated( DT[, list(rev(x),rev(y)) ]))                                                                                                                                                                                                                        > DT                                                                                                                                                                                                                                                                                 x y first  last                                                                                                                                                                                                                                                           [1,] a A  TRUE  TRUE                                                                                                                                                                                                                                                           [2,] a B  TRUE  TRUE                                                                                                                                                                                                                                                           [3,] a C  TRUE FALSE                                                                                                                                                                                                                                                           [4,] a C FALSE  TRUE                                                                                                                                                                                                                                                           [5,] b A  TRUE  TRUE                                                                                                                                                                                                                                                           [6,] b B  TRUE  TRUE                                                                                                                                                                                                                                                           [7,] b C  TRUE FALSE                                                                                                                                                                                                                                                           [8,] b C FALSE FALSE                                                                                                                                                                                                                                                           [9,] b C FALSE FALSE                                                                                                                                                                                                                                                          [10,] b C FALSE  TRUE                                                                                                                                                                                                                                                          [11,] c A  TRUE FALSE                                                                                                                                                                                                                                                          [12,] c A FALSE  TRUE                                                                                                                                                                                                                                                          [13,] c B  TRUE FALSE                                                                                                                                                                                                                                                          [14,] c B FALSE  TRUE                                                                                                                                                                                                                                                          [15,] c C  TRUE FALSE                                                                                                                                                                                                                                                          [16,] c C FALSE FALSE                                                                                                                                                                                                                                                          [17,] c C FALSE  TRUE             

Another way without using duplicated() is:

DT[ unique(DT), list(first = c(1, rep(0,length(y)-1)),                                                                                                                                                                                                                                              last =  c(rep(0,length(y)-1),1 )) ]              x y  first last                                                                                                                                                                                                                                                      [1,] a A     1    1                                                                                                                                                                                                                                                            [2,] a B     1    1                                                                                                                                                                                                                                                            [3,] a C     1    0                                                                                                                                                                                                                                                            [4,] a C     0    1                                                                                                                                                                                                                                                            [5,] b A     1    1                                                                                                                                                                                                                                                            [6,] b B     1    1                                                                                                                                                                                                                                                            [7,] b C     1    0                                                                                                                                                                                                                                                            [8,] b C     0    0                                                                                                                                                                                                                                                            [9,] b C     0    0                                                                                                                                                                                                                                                           [10,] b C     0    1                                                                                                                                                                                                                                                           [11,] c A     1    0                                                                                                                                                                                                                                                           [12,] c A     0    1                                                                                                                                                                                                                                                           [13,] c B     1    0                                                                                                                                                                                                                                                           [14,] c B     0    1                                                                                                                                                                                                                                                           [15,] c C     1    0                                                                                                                                                                                                                                                           [16,] c C     0    0                                                                                                                                                                                                                                                           [17,] c C     0    1           
like image 28
Prasad Chalasani Avatar answered Oct 23 '22 04:10

Prasad Chalasani


Here are couple of solutions using data.table:

## Option 1 (cleaner solution, added 2016-11-29) uDT <- unique(DT) DT[, c("first","last"):=0L] DT[uDT, first:=1L, mult="first"] DT[uDT, last:=1L, mult="last"]   ## Option 2 (original answer, retained for posterity) DT <- cbind(DT, first=0L, last=0L) DT[DT[unique(DT),,mult="first", which=TRUE], first:=1L] DT[DT[unique(DT),,mult="last", which=TRUE], last:=1L]  head(DT) #      x y first last # [1,] a A     1    1 # [2,] a B     1    1 # [3,] a C     1    0 # [4,] a C     0    1 # [5,] b A     1    1 # [6,] b B     1    1 

There's obviously a lot packed into each of those lines. The key construct, though, is the following, which returns the row index of the first record in each group:

DT[unique(DT),,mult="first", which=TRUE] # [1]  1  2  3  5  6  7 11 13 15 
like image 64
Josh O'Brien Avatar answered Oct 23 '22 04:10

Josh O'Brien