Let's say I have a data frame with several rows like the following: <pre class="prettyprint"><code>df <- data.frame(a = c(NA,20,NA), date1 = c("2016-03-01", "2016-02-01", "2016-02-01"), b = c(50,NA, NA), date2 = c("2016-02-01", "2016-03-01", "2016-03-01"), c = c(10,10, 10), date3 = c("2016-01-01","2016-01-01", "2016-01-01")) </code></pre> For each row, I want to get the latest value which is not a <code>NA</code> between <code>a</code>, <code>b</code>, and <code>c</code> according to the <code>dates</code> (so I respectively look at <code>date1</code>, <code>date2</code>, or <code>date3</code> and pick the most recent one). Basically, <code>date1</code> gives the date corresponding to the value <code>a</code>, <code>date2</code> gives the date corresponding to the value <code>b</code>, <code>date3</code> gives the date corresponding to the value <code>c</code>. If <code>date1 > date2</code> & <code>date1 > date3</code>, I will want to take the value <code>a</code> However, if the value <code>a</code> is <code>NA</code> (which is the case in my example), I will compare <code>date2</code> and <code>date3</code>. In my example, <code>date2 > date3</code> , and since value <code>b</code> is not <code>NA</code> but <code>50</code>, I will take <code>50</code> as my final result. Now I want to do this for all the rows in my dataframe Since I am using <code>dplyr</code>, I tried to use the <code>case_when</code> function by using the rank function (in my example, I look a the first ranked date, and then look at the linked value. If it is a NA, I look at the 2nd best ranked, etc...) However, I can't just put, as I'd like to do, : <pre class="prettyprint"><code>df <- df %>% mutate(result = case_when(is.na(a) & is.na(b) & is.na(c) ~ NA_integer_, rev(rank(date1, date2, date3))[1] == 3 & !is.na(a) ~ a, rev(rank(date1, date2, date3))[2] == 3 & !is.na(b) ~ b, rev(rank(date1, date2, date3))[3] == 3 & !is.na(a) ~ c, rev(rank(date1, date2, date3))[1] == 2 & !is.na(a) ~ a, rev(rank(date1, date2, date3))[2] == 2 & !is.na(b) ~ b, rev(rank(date1, date2, date3))[3] == 2 & !is.na(a) ~ c, rev(rank(date1, date2, date3))[1] == 1 & !is.na(a) ~ a, rev(rank(date1, date2, date3))[2] == 1 & !is.na(b) ~ b, rev(rank(date1, date2, date3))[3] == 1 & !is.na(a) ~ c)) </code></pre> Because the <code>rank</code> function needs a unique vector as argument (but I can't put <code>c(date1, date2, date3)</code> neither because it would give me the whole order of this vector and not the rank for each row) In my example the result I would like to have would be <pre class="prettyprint"><code>res a date1 b date2 c date3 result NA 2016-03-01 50 2016-02-01 10 2016-01-01 50 20 2016-02-01 NA 2016-03-01 10 2016-01-01 20 NA 2016-02-01 NA 2016-03-01 10 2016-01-01 10 </code></pre> Does anyone have an idea or even an entirely different approach to this problem ?

This ought to handle it. First we put the data in tidy form (1 row for each Date, Value, along with a row_num to identify which example the tidy row belongs to). Then we filter out NAs, group_by row_num, order by Date descending, and take the first row. <pre class="prettyprint"><code>df %>% mutate(row_num = row_number()) %>% unite(a, a, date1) %>% unite(b, b, date2) %>% unite(c, c, date3) %>% gather(key, value, -row_num) %>% select(-key) %>% separate(value, into=c("Value", "Date"), sep = "_") %>% mutate(Date = as.Date(Date)) %>% filter(Value != "NA") %>% group_by(row_num) %>% top_n(1, Date) %>% ungroup() </code></pre> <img src="https://i.stack.imgur.com/5Hro2.png" alt="Results">

Here one more approach: <pre class="prettyprint"><code>df$row <- 1:nrow(df) gather(df, key, date_val, date1, date2, date3, -row) %>% select(-key) %>% gather(key, val, a,b,c) %>% filter(!is.na(val)) %>% group_by(row) %>% mutate(max_date = max(date_val)) %>% filter(date_val == max_date) %>% summarise(result = max(val)) %>% left_join(df, by="row") %>% select(-row) # A tibble: 3 × 7 result a date1 b date2 c date3 <dbl> <dbl> <fctr> <dbl> <fctr> <dbl> <fctr> 1 50 NA 2016-03-01 50 2016-02-01 10 2016-01-01 2 20 20 2016-02-01 NA 2016-03-01 10 2016-01-01 3 10 NA 2016-02-01 NA 2016-03-01 10 2016-01-01 </code></pre>

Select values row-wise based on rank among dates

Tags:

r

dplyr

case-when

rank

Let's say I have a data frame with several rows like the following:

Click to copy

df <- data.frame(a = c(NA,20,NA),
                 date1 = c("2016-03-01", "2016-02-01", "2016-02-01"),
                 b = c(50,NA, NA),
                 date2 = c("2016-02-01", "2016-03-01", "2016-03-01"),
                 c = c(10,10, 10),
                 date3 = c("2016-01-01","2016-01-01", "2016-01-01"))

For each row, I want to get the latest value which is not a NA between a, b, and c according to the dates (so I respectively look at date1, date2, or date3 and pick the most recent one).

Basically, date1 gives the date corresponding to the value a, date2 gives the date corresponding to the value b, date3 gives the date corresponding to the value c.

If date1 > date2 & date1 > date3, I will want to take the value a However, if the value a is NA (which is the case in my example), I will compare date2 and date3. In my example, date2 > date3 , and since value b is not NA but 50, I will take 50 as my final result.

Now I want to do this for all the rows in my dataframe

Since I am using dplyr, I tried to use the case_when function by using the rank function (in my example, I look a the first ranked date, and then look at the linked value. If it is a NA, I look at the 2nd best ranked, etc...)

However, I can't just put, as I'd like to do, :

Click to copy

df <- df %>%
        mutate(result = case_when(is.na(a) & is.na(b) & is.na(c) ~ NA_integer_,
                                  rev(rank(date1, date2, date3))[1] == 3 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 3 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 3 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 2 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 2 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 2 & !is.na(a) ~ c,
                                  rev(rank(date1, date2, date3))[1] == 1 & !is.na(a) ~ a,
                                  rev(rank(date1, date2, date3))[2] == 1 & !is.na(b) ~ b,
                                  rev(rank(date1, date2, date3))[3] == 1 & !is.na(a) ~ c))

Because the rank function needs a unique vector as argument (but I can't put c(date1, date2, date3) neither because it would give me the whole order of this vector and not the rank for each row)

In my example the result I would like to have would be

Click to copy

res

a    date1         b      date2       c    date3       result
NA   2016-03-01    50     2016-02-01  10   2016-01-01  50
20   2016-02-01    NA     2016-03-01  10   2016-01-01  20
NA   2016-02-01    NA     2016-03-01  10   2016-01-01  10

Does anyone have an idea or even an entirely different approach to this problem ?

462

asked Aug 28 '17 12:08

MBB

5 Answers

I suggest converting to long-format and computing the relevant values. If you want, you can then add the results to your original data.frame. Here's how you could do that using data.table:

Click to copy

library(data.table)
setDT(df)                     # convert to data.table object
df[, row := .I]               # add a row-id
dflong <- melt(df, id = "row", measure = patterns("^date", "^(a|b|c)"),
               na.rm = TRUE) # convert to long format
setorder(dflong, value1)      # reorder by date value
dflong <- unique(dflong, by = "row", fromLast = TRUE) # get the latest dates
df[dflong, result := i.value2, on = "row"]  # add result to original data

df
#    a      date1  b      date2  c      date3 row result
#1: NA 2016-03-01 50 2016-02-01 10 2016-01-01   1     50
#2: 20 2016-02-01 NA 2016-03-01 10 2016-01-01   2     20
#3: NA 2016-02-01 NA 2016-03-01 10 2016-01-01   3     10

185

answered Oct 16 '22 17:10

talat

This ought to handle it. First we put the data in tidy form (1 row for each Date, Value, along with a row_num to identify which example the tidy row belongs to). Then we filter out NAs, group_by row_num, order by Date descending, and take the first row.

Click to copy

df %>%
  mutate(row_num = row_number()) %>%
  unite(a, a, date1) %>%
  unite(b, b, date2) %>%
  unite(c, c, date3) %>%
  gather(key, value, -row_num) %>%
  select(-key) %>%
  separate(value, into=c("Value", "Date"), sep = "_") %>%
  mutate(Date = as.Date(Date)) %>%
  filter(Value != "NA") %>%
  group_by(row_num) %>%
  top_n(1, Date) %>%
  ungroup()

Results

answered Oct 16 '22 17:10

Jason Punyon

Here is one way to do it...

Click to copy

df$result <- apply(df, 1, function(x){
  dates <- as.Date(x[seq(2, length(x), 2)])
  values <- x[seq(1,length(x),2)]
  return(values[!is.na(values)][which.max(dates[!is.na(values)])])
})

df
   a      date1  b      date2  c      date3 result
1 NA 2016-03-01 50 2016-02-01 10 2016-01-01     50
2 20 2016-02-01 NA 2016-03-01 10 2016-01-01     20
3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01     10

answered Oct 16 '22 16:10

Andrew Gustar

Here one more approach:

Click to copy

df$row <- 1:nrow(df)

gather(df, key, date_val, date1, date2, date3, -row) %>% 
   select(-key) %>% 
   gather(key, val, a,b,c) %>% 
   filter(!is.na(val)) %>% 
   group_by(row) %>% 
   mutate(max_date = max(date_val)) %>% 
   filter(date_val == max_date) %>% summarise(result = max(val)) %>% 
   left_join(df, by="row") %>% select(-row)

# A tibble: 3 × 7
  result     a      date1     b      date2     c      date3
   <dbl> <dbl>     <fctr> <dbl>     <fctr> <dbl>     <fctr>
1     50    NA 2016-03-01    50 2016-02-01    10 2016-01-01
2     20    20 2016-02-01    NA 2016-03-01    10 2016-01-01
3     10    NA 2016-02-01    NA 2016-03-01    10 2016-01-01

answered Oct 16 '22 15:10

coffeinjunky

Another base alternative:

Click to copy

df$id <- 1:nrow(df)
d2 <- reshape(df, varying = list(seq(1, by = 2, len = (ncol(df) - 1)/2),
                                 seq(2, by = 2, len = (ncol(df) - 1)/2)),
              direction = "long")

d2 <- with(d2, d2[order(-id, date1, decreasing = TRUE), ])

cbind(df, res = tapply(d2$a[!is.na(d2$a)], d2$id[!is.na(d2$a)], `[`, 1)) 
#    a      date1  b      date2  c      date3 id res
# 1 NA 2016-03-01 50 2016-02-01 10 2016-01-01  1  50
# 2 20 2016-02-01 NA 2016-03-01 10 2016-01-01  2  20
# 3 NA 2016-02-01 NA 2016-03-01 10 2016-01-01  3  10

answered Oct 16 '22 16:10

Henrik

Related questions
                            
                                How to create an interactive plot of GTFS data in R using Leaflet?
                            
                                What are the steps necessary to pass R objects to a Rust program?
                            
                                How to use dcast.data.table with formula as string
                            
                                RStudio Python Version Change on Mac
                            
                                ggplotly not displaying geom_line correctly
                            
                                Print positive numbers with plus sign
                            
                                R calculating distance between 2 points on earth using package geosphere [closed]
                            
                                Rmarkdown Retain .tex file
                            
                                Using R - How do I search for a file/folder on all drives (hard drives as well as USB drives)
                            
                                Sed directory not found when running R with -e flag
                            
                                Getting GPS position with R package leaflet
                            
                                Can't use Rcpp engine in R Markdown
                            
                                R/ShinyApp not showing plot_ly in browser but show only graph in viewer pane
                            
                                Setting the schema name in postgres using R
                            
                                Only grobs allowed in gList
                            
                                Reading csv files in chunks with `readr::read_csv_chunked()`
                            
                                Change geom_text to bold when parse=TRUE
                            
                                Download multiple csv files with one button (downloadhandler) with R Shiny
                            
                                R- knitr:kable - How to display table without column names?
                            
                                Close and open all chunks feature in a RMarkdown script in Rstudio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Select values row-wise based on rank among dates

Tags:

r

dplyr

case-when

rank

MBB

People also ask

5 Answers

talat

Jason Punyon

Andrew Gustar

coffeinjunky

Henrik

Recent Activity

Donate For Us