I have a dataframe having country_name, date and several columns: column_1, column_2and column_3. I am trying to extract the latest record based on date across several columns.
The dataframe looks like this:
| country_name | date | column_1| column_2| column_3|
| US | 2016-11-02 | 7.5 | NA | NA |
| US | 2017-09-12 | NA | NA | 9 |
| US | 2017-09-19 | NA | 8 | 10 |
| US | 2020-02-10 | 10 | NA | NA |
| US | 2021-03-10 | NA | NA | 7.3 |
| US | 2021-05-02 | NA | 3 | NA |
| UK | 2016-11-02 | NA | 2 | NA |
| UK | 2017-09-12 | 0.5 | 3 | NA |
.
.
For the US the desired output is:
| country_name | column_1| column_2| column_3|
| US | 10 | 3 | 7.3 |
For column_1, the value with the latest date is 10 (date: 2020-02-10),
for column_2 is 3 (date: 2021-05-02), and for column_3 is 7.3 (date: 2021-03-10). My goal is to apply this logic across several countries. How do I achieve this?
library(dplyr)
library(tidyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(country_name) %>%
arrange(date) %>%
select(-date) %>%
fill(everything()) %>%
slice(n())
#> # A tibble: 2 x 4
#> # Groups: country_name [2]
#> country_name column_1 column_2 column_3
#> <chr> <dbl> <int> <dbl>
#> 1 UK 0.5 3 NA
#> 2 US 10 3 7.3
read.table(text = "country_name date column_1 column_2 column_3
US 2016-11-02 7.5 NA NA
US 2017-09-12 NA NA 9
US 2017-09-19 NA 8 10
US 2020-02-10 10 NA NA
US 2021-03-10 NA NA 7.3
US 2021-05-02 NA 3 NA
UK 2016-11-02 NA 2 NA
UK 2017-09-12 0.5 3 NA",
header = T, stringsAsFactors = F) -> df1
You could na.omit and reverse each column and take first element. Then rbind. Take care of the right order and if it's as.Date formatted.
by(transform(dat, date=as.Date(date)), dat$country_name, \(x) {
cbind(x[1, 1, drop=FALSE],
lapply(x[order(x$date), 3:5], \(z) {
z <- el(rev(na.omit(z)))
ifelse(length(z) == 1, z, NA_real_)
}))
}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
# country_name column_1 column_2 column_3
# 1 UK 0.5 3 NA
# 2 US 10.0 3 7.3
Data:
dat <- structure(list(country_name = c("US", "US", "US", "US", "US",
"US", "UK", "UK"), date = c("2016-11-02", "2017-09-12", "2017-09-19",
"2020-02-10", "2021-03-10", "2021-05-02", "2016-11-02", "2017-09-12"
), column_1 = c(7.5, NA, NA, 10, NA, NA, NA, 0.5), column_2 = c(NA,
NA, 8L, NA, NA, 3L, 2L, 3L), column_3 = c(NA, 9, 10, NA, 7.3,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -8L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With