I am trying to create a sequence of years for multiple IDs in R. My input table has a single row for each ID, and gives a Start_year. It looks like this:
ID Start_year
01 1999
02 2004
03 2015
04 2007
etc...
I need to create a table with multiple rows for each ID, showing each year from their Start_year up to 2015. I will then use this to join to another table. So in my example, ID1 would have 17 rows with the years 1999:2015. ID2 would have 12 rows 2004:2015, ID3 would have 1 row 2015, and ID4 would have 9 rows 2007:2015.
For a subset of my data I can get this to work using the following code:
df %>% group_by(ID) %>% expand(year = Start_year:2015, Start_year) %>% select(-Start_year)
However, my full dataset has about 5 million IDs, and this command seems to be extremely slow, taking many hours.
I'm therefore looking for a faster implementation of this command in R. In my experience, data.table commands often seem to be faster than dplyr/tidyr - however, I am quite unfamiliar with data.table syntax.
You could do
out <- DT[, .(col = seq.int(Start_year, 2015L)), by = ID]
out
# ID col
# 1: 1 1999
# 2: 1 2000
# 3: 1 2001
# 4: 1 2002
# 5: 1 2003
# 6: 1 2004
# 7: 1 2005
# 8: 1 2006
# 9: 1 2007
# ...
In your case you would probably need to do
setDT(df)[, .(col = seq.int(Start_year, 2015L)), by = ID]
A tidyverse
way of the same idea
library(readr); library(dplyr); library(tidyr)
tbl <- read_table(text)
tbl %>%
group_by(ID) %>%
mutate(Start_year = list(seq.int(Start_year, 2015L))) %>%
# rename(new_col = Start_year)
unnest()
data
text <- "ID Start_year
01 1999
02 2004
03 2015
04 2007"
library(data.table)
DT <- fread(text)
If you have enough memory, you could take full set of IDs x years and filter with a rolling join:
res <- DT[
CJ(ID, Start_year = seq.int(min(Start_year), 2015L)),
on=.(ID, Start_year),
roll=TRUE,
nomatch=0
]
setnames(res, "Start_year", "Year")[]
CJ
takes the "cross join" of the vector of IDs and years. If you are not on the latest version of data.table, you may need to name both arguments (ie, CJ(ID = ID, Start_year = seq.int(min(Start_year), 2015L))
).
Comment. The OP says @markus' approach already brings the operation down to seconds, so maybe further improvement is not needed... Also, I'm not really sure that there are any circumstances under which my approach would be faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With