Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract most recent entry, under a certain condition

I'm struggling to get the following done:

Example dataset:

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 

The problem is: I would like to extract the most recent entry (largest value for time) per belongID, unless this rating is 0. If the rating of the most recent entry is 0 however. I want the first entry with a rating (not the highest rating, just the first value with a rating that is not zero). If all other entries are also zero, the most recent one needs to be selected.

The end result should than be:

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   2           105       2      5
   3           108       5      1  

The dataset is pretty large and is ordered by belongID. It is not ordered by time, so more recent entries may come after older entries with the same belongID.

Without having the "0 Rating" constraint, I used the following function to calculate the most recent entry:

>uniqueMax <- function(m, belongID = 1, time = 3) {
    t(
      vapply(
         split(1:nrow(m), m[,belongID]), 
         function(i, x, time) x[i, , drop=FALSE][which.max(x[i,time]),], m[1,], x=m, time=time
      )
    )
 }

I do not know how to incorporate the "0 Rating" constraint.

EDIT: A follow up question:

Does anyone know how the getRating function should be altered if not only rating zero, but more ratings need to be taken into account (for instance 0,1,4 and 5)? Thus assign to most recent, unless Rating 0 or 1 or 4 or 5? If Rating is 0,1,4,5 assign to most recent entry with a different rating. If all ratings are 0,1,4 or 5 assign to the most recent of those. I tried the following, but that did not work:

getRating <- function(x){
  iszero <- x$Rating == 0 | x$Rating == 1 | x$Rating == 4 | x$Rating ==5
  if(all(iszero)){
    id <- which.max(x$Time)
  } else {
    id <- which.max((!iszero)*x$Time) 
            # This trick guarantees taking 0 into account
  }
  x[id,]
}
# Do this over the complete data frame
do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 
     # edited per Tyler's suggestion'
like image 707
Max van der Heijden Avatar asked May 10 '12 12:05

Max van der Heijden


People also ask

How do I get the most recent entry in SQL?

Use the aggregate MAX(signin) grouped by id. This will list the most recent signin for each id . To get the whole single record, perform an INNER JOIN against a subquery which returns only the MAX(signin) per id.

Which query can be used to find the last record entry?

To get the last record, the following is the query. mysql> select *from getLastRecord ORDER BY id DESC LIMIT 1; The following is the output. The above output shows that we have fetched the last record, with Id 4 and Name Carol.

How do I get last 5 entries in SQL?

METHOD 1 : Using LIMIT clause in descending orderof specified rows from specifies row. We will retrieve last 5 rows in descending order using LIMIT and ORDER BY clauses and finally make the resultant rows ascending. Since Employee table has IDs, we will perform ORDER BY ID in our query.

How do I select a specific entry in SQL?

To select rows using selection symbols for character or graphic data, use the LIKE keyword in a WHERE clause, and the underscore and percent sign as selection symbols. You can create multiple row conditions, and use the AND, OR, or IN keywords to connect the conditions.


2 Answers

Here's my crack at it (interesting problem):

Reading in your data:

m <- read.table(text="belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 ", header=T)

Extracting the rows you asked for:

m2 <- m[order(m$belongID, -m$Time), ]                 #Order to get max time first
LIST <- split(m2, m$belongID)                         #split by belongID
FUN <- function(x) which(cumsum(x[, 'Rating'])!=0)[1] #find first non zero Rating
LIST2 <- lapply(LIST, function(x){                    #apply FUN; if NA do 1st row
        if (is.na(FUN(x))) {
            x[1, ]
        } else {
            x[FUN(x), ]
        }
    }
)
do.call('rbind', LIST2)                              #put it all back together

Which yields:

  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

EDIT With so many people answering this problem (fun to solve IMHO) it begged for a microbenchmark test (Windows 7):

Unit: milliseconds
    expr       min        lq    median        uq      max
1   JIGR  6.356293  6.656752  7.024161  8.697213 179.0884
2 JORRIS  2.932741  3.031416  3.153420  3.552554 246.9604
3  PETER 10.851046 11.459896 12.358939 17.164881 216.7284
4  TYLER  2.864625  2.961667  3.066174  3.413289 221.1569

And a graph:

enter image description here

like image 156
Tyler Rinker Avatar answered Sep 27 '22 22:09

Tyler Rinker


Here's a solution that uses data.table for ease of filtering and performing my function getRecentRow separately for each belongID.

library(data.table)

# Load the data from the example.
dat = structure(list(belongID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), 
          uniqID = 101:108, Time = c(5L, 4L, 4L, 3L, 2L, 4L, 5L, 5L),
          Rating = c(0L, 0L, 0L, 0L, 5L, 2L, 0L, 1L)), 
          .Names = c("belongID", "uniqID", "Time", "Rating"),
          row.names = c(NA, -8L), class = c("data.table", "data.frame"))

dat = data.table(dat) # Convert to data table.

# Function to get the row for a given belongID
getRecentRow <- function(data) {
    # Filter by Rating, then order by time, then select first.
    row = data[Rating != 0][order(-Time)][1]

    if(!is.na(row$uniqID)) {
        # A row was found with Rating != 0, return it.
        return(row)
     } else {
          # The row was blank, so filter again without restricting. rating.
          return(data[order(-Time)][1])
        }  
}

# Run getRecentRow on each chunk of dat with a given belongID
result = dat[,getRecentRow(.SD), by=belongID]

     belongID uniqID Time Rating
[1,]        1    101    5      0
[2,]        2    105    2      5
[3,]        3    108    5      1
like image 27
Peter Fine Avatar answered Sep 27 '22 23:09

Peter Fine