I have data for one patient in R, which shows dates when they tested positive for a certain condition. The data looks like this:
date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes
Now I introduce a new definition. The patient's condition is defined as "chronic positive" if "current test is positive, and >=50% of tests in the prior 365 days were positive". So I want to create an output dataset that tells me whether the patient was chronically positive at each date. For example, the output should look like this (e.g. on 2006-09-12, they are "positive" but not "chronic positive" because 3 out of 4 visits in the previous 365 days were negative):
date positive chronic
2005-02-22 yes no
2005-04-26 no no
2005-08-02 yes yes
2005-10-04 no no
2005-12-06 yes yes
2006-03-14 no no
2006-06-06 no no
2006-09-12 yes no
2006-12-19 yes no
2007-03-27 yes yes
How can I do this? At each row of interest, I need to be able to look at previous rows (within the last 365 days) and assess what proportion of them were positive. I think I could use a combination of the lead
/lag
functions and dplyr
, but I would appreciate an example of how this can be done.
The original data can be reproduced with:
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
To use mutate in R, all you need to do is call the function, specify the dataframe, and specify the name-value pair for the new variable you want to create. The explanation I just gave is pretty straightforward, but to make it more concrete, let’s work with some actual data.
How to add a new column to a data frame using mutate in R? How to add a new column to a data frame using mutate in R? The mutate function of dplyr package in R can help us to add a new column to a data frame and the benefit of using mutate is that we can decide the position of the new column during the addition.
Source: R/manip.r. mutate.Rd. mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. Both functions preserve the number of rows of the input. New variables overwrite existing variables of the same name.
The mutate function of dplyr package in R can help us to add a new column to a data frame and the benefit of using mutate is that we can decide the position of the new column during the addition. For example, if we have a data frame called df that contains three columns say x, y, a then we can add a new column say z after y using mutate function.
You can make use slider
library for such rolling computation. Syntax explanation -
slide_index_lgl
works on a vector .x
and an index .i
simultaneously and produces a logical vector output..x
is used as positive
vector.i
is used as date
vector.before
and .after
are self explanatory (previous 365 days included and current day excluded).f
is simple where test positivity in previous 365 days is checkedpositive == 'yes'
I used this formula (sum(.x == 'yes') / length(.x)) >= 0.5
1
for FALSE
and 2
for TRUE
c('No', 'Yes') so that you'll get
Yesfor
TRUEand
Nofor
FALSE`library(tidyverse)
df <- read.table(header = TRUE, text = 'date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes')
df$date <- as.Date(df$date)
library(slider)
library(lubridate)
df %>%
mutate(chronic = c('No', "Yes")[1 + (positive == 'yes' & slide_index_lgl(positive, date,
~ (sum(.x == 'yes') / length(.x)) >= 0.5 ,
.before = days(365),
.after = days(-1)))])
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no No
#> 3 2005-08-02 yes Yes
#> 4 2005-10-04 no No
#> 5 2005-12-06 yes Yes
#> 6 2006-03-14 no No
#> 7 2006-06-06 no No
#> 8 2006-09-12 yes No
#> 9 2006-12-19 yes No
#> 10 2007-03-27 yes Yes
Alternative strategy using runner::runner()
in baseR
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
library(runner)
dat$chronic <- ifelse(runner(dat$positive, idx = dat$date, lag = '1 day',
k = '365 days',
f = \(.x) (sum(.x == 'yes')/length(.x)) >= 0.5) & dat$positive == 'yes', 'yes', 'no')
dat
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no no
#> 3 2005-08-02 yes yes
#> 4 2005-10-04 no no
#> 5 2005-12-06 yes yes
#> 6 2006-03-14 no no
#> 7 2006-06-06 no no
#> 8 2006-09-12 yes no
#> 9 2006-12-19 yes no
#> 10 2007-03-27 yes yes
Another option using non-equi join in data.table
:
library(data.table)
setDT(dat)[, yrago := date - 365L]
dat[, chronic := fifelse(
.SD[.SD, on=.(date>=yrago, date<date),
by=.EACHI, .N>0 & i.positive=="yes" & sum(x.positive=="yes")/.N >= 0.5]$V1,
"yes", "no")
]
dat[, yrago := NULL][]
output:
date positive chronic
1: 2005-02-22 yes no
2: 2005-04-26 no no
3: 2005-08-02 yes yes
4: 2005-10-04 no no
5: 2005-12-06 yes yes
6: 2006-03-14 no no
7: 2006-06-06 no no
8: 2006-09-12 yes no
9: 2006-12-19 yes no
10: 2007-03-27 yes yes
You can also use this solution in case you don't want to use rolling functions:
library(dplyr)
library(purrr)
library(lubridate)
map(df %>%
filter(positive == "yes") %>%
pull(date), ~ df %>% filter(date %within% interval(.x - days(365), .x))) %>%
map_dfr(~ .x %>%
summarise(date = last(date),
chronic = (sum(positive == "yes")-1)/ (n()-1) >= 0.5)) %>%
right_join(df, by = "date") %>%
arrange(date) %>%
mutate(chronic = if_else(is.na(chronic) | !chronic, "no", "yes"))
# A tibble: 10 x 3
date chronic positive
<chr> <chr> <chr>
1 2005-02-22 no yes
2 2005-04-26 no no
3 2005-08-02 yes yes
4 2005-10-04 no no
5 2005-12-06 yes yes
6 2006-03-14 no no
7 2006-06-06 no no
8 2006-09-12 no yes
9 2006-12-19 no yes
10 2007-03-27 yes yes
Here is one way -
library(dplyr)
library(purrr)
dat %>%
mutate(chronic = map_chr(row_number(), ~{
inds <- between(date, date[.x] - 365, date[.x] - 1)
if(positive[.x] == "yes" && any(inds) && mean(positive[inds] == 'yes') >= 0.5) 'yes' else 'no'
}))
# date positive chronic
#1 2005-02-22 yes no
#2 2005-04-26 no no
#3 2005-08-02 yes yes
#4 2005-10-04 no no
#5 2005-12-06 yes yes
#6 2006-03-14 no no
#7 2006-06-06 no no
#8 2006-09-12 yes no
#9 2006-12-19 yes no
#10 2007-03-27 yes yes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With