Get the latest non-NA value based on date column by group

Question

I have a dataframe having country_name, date and several columns: column_1, column_2and column_3. I am trying to extract the latest record based on date across several columns.

The dataframe looks like this:

| country_name | date        | column_1| column_2| column_3|
| US           | 2016-11-02  | 7.5     | NA      | NA      |
| US           | 2017-09-12  | NA      | NA      | 9       |
| US           | 2017-09-19  | NA      | 8       | 10      |
| US           | 2020-02-10  | 10      | NA      | NA      |
| US           | 2021-03-10  | NA      | NA      | 7.3     |
| US           | 2021-05-02  | NA      | 3       | NA      |
| UK           | 2016-11-02  | NA      | 2       | NA      |
| UK           | 2017-09-12  | 0.5     | 3       | NA      |
 .
 .

For the US the desired output is:

| country_name | column_1| column_2| column_3|
| US           | 10      | 3       | 7.3     |

For column_1, the value with the latest date is 10 (date: 2020-02-10), for column_2 is 3 (date: 2021-05-02), and for column_3 is 7.3 (date: 2021-03-10). My goal is to apply this logic across several countries. How do I achieve this?

M-- · Accepted Answer

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(country_name) %>%
  arrange(date) %>%
  select(-date) %>% 
  fill(everything()) %>% 
  slice(n())

#> # A tibble: 2 x 4
#> # Groups:   country_name [2]
#>   country_name column_1 column_2 column_3
#>   <chr>           <dbl>    <int>    <dbl>
#> 1 UK                0.5        3     NA  
#> 2 US               10          3      7.3

Data:

read.table(text = "country_name  date         column_1 column_2 column_3
                   US            2016-11-02   7.5      NA       NA      
                   US            2017-09-12   NA       NA       9       
                   US            2017-09-19   NA       8        10      
                   US            2020-02-10   10       NA       NA      
                   US            2021-03-10   NA       NA       7.3     
                   US            2021-05-02   NA       3        NA      
                   UK            2016-11-02   NA       2        NA      
                   UK            2017-09-12   0.5      3        NA", 
           header = T, stringsAsFactors = F) -> df1

jay.sf · Answer

You could na.omit and reverse each column and take first element. Then rbind. Take care of the right order and if it's as.Date formatted.

by(transform(dat, date=as.Date(date)), dat$country_name, \(x) {
  cbind(x[1, 1, drop=FALSE], 
        lapply(x[order(x$date), 3:5], \(z) {
          z <- el(rev(na.omit(z)))
          ifelse(length(z) == 1, z, NA_real_)
        }))
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
#   country_name column_1 column_2 column_3
# 1           UK      0.5        3       NA
# 2           US     10.0        3      7.3

Data:

dat <- structure(list(country_name = c("US", "US", "US", "US", "US", 
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19", 
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA, 
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3, 
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))

Get the latest non-NA value based on date column by group

Tags:

dataframe

r

grouping

user11226439

2 Answers

Data:

M--

jay.sf

Recent Activity

Donate For Us

Get the latest non-NA value based on date column by group

Tags:

dataframe

r

grouping

user11226439

2 Answers

Data:

M--

jay.sf

Related questions

Recent Activity

Donate For Us