Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select values row-wise based on rank among dates

Let's say I have a data frame with several rows like the following:

df <- data.frame(a = c(NA,20,NA),
                 date1 = c("2016-03-01", "2016-02-01", "2016-02-01"),
                 b = c(50,NA, NA),
                 date2 = c("2016-02-01", "2016-03-01", "2016-03-01"),
                 c = c(10,10, 10),
                 date3 = c("2016-01-01","2016-01-01", "2016-01-01"))

For each row, I want to get the latest value which is not a NA between a, b, and c according to the dates (so I respectively look at date1, date2, or date3 and pick the most recent one).

Basically, date1 gives the date corresponding to the value a, date2 gives the date corresponding to the value b, date3 gives the date corresponding to the value c.

If date1 > date2 & date1 > date3, I will want to take the value a However, if the value a is NA (which is the case in my example), I will compare date2 and date3. In my example, date2 > date3 , and since value b is not NA but 50, I will take 50 as my final result.

Now I want to do this for all the rows in my dataframe

Since I am using dplyr, I tried to use the case_when function by using the rank function (in my example, I look a the first ranked date, and then look at the linked value. If it is a NA, I look at the 2nd best ranked, etc...)

However, I can't just put, as I'd like to do, :

df <- df %>%
        mutate(result = case_when(is.na(a) & is.na(b) & is.na(c) ~ NA_integer_,
                                  rev(rank(date1, date2, date3))[1] == 3 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 3 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 3 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 2 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 2 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 2 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 1 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 1 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 1 & !is.na(a) ~ c))

Because the rank function needs a unique vector as argument (but I can't put c(date1, date2, date3) neither because it would give me the whole order of this vector and not the rank for each row)

In my example the result I would like to have would be

res

a    date1         b      date2       c    date3       result
NA   2016-03-01    50     2016-02-01  10   2016-01-01  50
20   2016-02-01    NA     2016-03-01  10   2016-01-01  20
NA   2016-02-01    NA     2016-03-01  10   2016-01-01  10

Does anyone have an idea or even an entirely different approach to this problem ?

like image 462
MBB Avatar asked Aug 28 '17 12:08

MBB


People also ask

How to find the rank of each row in result set?

Summary: in this tutorial, you will learn how to use SQL RANK () function to find the rank of each row in the result set. The RANK () function is a window function that assigns a rank to each row in the partition of a result set.

How do I select rows from a list of values?

The following code shows how to select rows where the value in a certain column belongs to a list of values: #select rows where team is equal to 'A' or 'C' df [df$team %in% c ('A', 'C'), ]

What happens if more than one row has the same rank?

Even though more than one row can have the same rank, the next row will get the next rank. For example, if two rows are 5th, the next row will be 6th. The ROW_NUMBER () function is different.

Can you predict which row will get which rank in SQL?

It depends on many factors, and you can’t predict which row will get which rank. These are the basic and (probably) most commonly used SQL ranking functions. If you want to learn about other ranking functions, you can read this overview of ranking functions.


5 Answers

I suggest converting to long-format and computing the relevant values. If you want, you can then add the results to your original data.frame. Here's how you could do that using data.table:

library(data.table)
setDT(df)                     # convert to data.table object
df[, row := .I]               # add a row-id
dflong <- melt(df, id = "row", measure = patterns("^date", "^(a|b|c)"),
               na.rm = TRUE) # convert to long format
setorder(dflong, value1)      # reorder by date value
dflong <- unique(dflong, by = "row", fromLast = TRUE) # get the latest dates
df[dflong, result := i.value2, on = "row"]  # add result to original data

df
#    a      date1  b      date2  c      date3 row result
#1: NA 2016-03-01 50 2016-02-01 10 2016-01-01   1     50
#2: 20 2016-02-01 NA 2016-03-01 10 2016-01-01   2     20
#3: NA 2016-02-01 NA 2016-03-01 10 2016-01-01   3     10
like image 185
talat Avatar answered Oct 16 '22 17:10

talat


This ought to handle it. First we put the data in tidy form (1 row for each Date, Value, along with a row_num to identify which example the tidy row belongs to). Then we filter out NAs, group_by row_num, order by Date descending, and take the first row.

df %>%
  mutate(row_num = row_number()) %>%
  unite(a, a, date1) %>%
  unite(b, b, date2) %>%
  unite(c, c, date3) %>%
  gather(key, value, -row_num) %>%
  select(-key) %>%
  separate(value, into=c("Value", "Date"), sep = "_") %>%
  mutate(Date = as.Date(Date)) %>%
  filter(Value != "NA") %>%
  group_by(row_num) %>%
  top_n(1, Date) %>%
  ungroup()

Results

like image 25
Jason Punyon Avatar answered Oct 16 '22 17:10

Jason Punyon


Here is one way to do it...

df$result <- apply(df, 1, function(x){
  dates <- as.Date(x[seq(2, length(x), 2)])
  values <- x[seq(1,length(x),2)]
  return(values[!is.na(values)][which.max(dates[!is.na(values)])])
})

df
   a      date1  b      date2  c      date3 result
1 NA 2016-03-01 50 2016-02-01 10 2016-01-01     50
2 20 2016-02-01 NA 2016-03-01 10 2016-01-01     20
3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01     10
like image 2
Andrew Gustar Avatar answered Oct 16 '22 16:10

Andrew Gustar


Here one more approach:

df$row <- 1:nrow(df)

gather(df, key, date_val, date1, date2, date3, -row) %>% 
   select(-key) %>% 
   gather(key, val, a,b,c) %>% 
   filter(!is.na(val)) %>% 
   group_by(row) %>% 
   mutate(max_date = max(date_val)) %>% 
   filter(date_val == max_date) %>% summarise(result = max(val)) %>% 
   left_join(df, by="row") %>% select(-row)

# A tibble: 3 × 7
  result     a      date1     b      date2     c      date3
   <dbl> <dbl>     <fctr> <dbl>     <fctr> <dbl>     <fctr>
1     50    NA 2016-03-01    50 2016-02-01    10 2016-01-01
2     20    20 2016-02-01    NA 2016-03-01    10 2016-01-01
3     10    NA 2016-02-01    NA 2016-03-01    10 2016-01-01
like image 1
coffeinjunky Avatar answered Oct 16 '22 15:10

coffeinjunky


Another base alternative:

df$id <- 1:nrow(df)
d2 <- reshape(df, varying = list(seq(1, by = 2, len = (ncol(df) - 1)/2),
                                 seq(2, by = 2, len = (ncol(df) - 1)/2)),
              direction = "long")

d2 <- with(d2, d2[order(-id, date1, decreasing = TRUE), ])

cbind(df, res = tapply(d2$a[!is.na(d2$a)], d2$id[!is.na(d2$a)], `[`, 1)) 
#    a      date1  b      date2  c      date3 id res
# 1 NA 2016-03-01 50 2016-02-01 10 2016-01-01  1  50
# 2 20 2016-02-01 NA 2016-03-01 10 2016-01-01  2  20
# 3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01  3  10
like image 1
Henrik Avatar answered Oct 16 '22 16:10

Henrik