Faster equivalent to group_by %% expand in R

Question

I am trying to create a sequence of years for multiple IDs in R. My input table has a single row for each ID, and gives a Start_year. It looks like this:

ID    Start_year
01          1999
02          2004
03          2015
04          2007

etc...

I need to create a table with multiple rows for each ID, showing each year from their Start_year up to 2015. I will then use this to join to another table. So in my example, ID1 would have 17 rows with the years 1999:2015. ID2 would have 12 rows 2004:2015, ID3 would have 1 row 2015, and ID4 would have 9 rows 2007:2015.

For a subset of my data I can get this to work using the following code:

df %>% group_by(ID) %>% expand(year = Start_year:2015, Start_year) %>% select(-Start_year)

However, my full dataset has about 5 million IDs, and this command seems to be extremely slow, taking many hours.

I'm therefore looking for a faster implementation of this command in R. In my experience, data.table commands often seem to be faster than dplyr/tidyr - however, I am quite unfamiliar with data.table syntax.

markus · Accepted Answer

You could do

out <- DT[, .(col = seq.int(Start_year, 2015L)), by = ID]
out
#    ID  col
# 1:  1 1999
# 2:  1 2000
# 3:  1 2001
# 4:  1 2002
# 5:  1 2003
# 6:  1 2004
# 7:  1 2005
# 8:  1 2006
# 9:  1 2007
# ...

In your case you would probably need to do

setDT(df)[, .(col = seq.int(Start_year, 2015L)), by = ID]

A tidyverse way of the same idea

library(readr); library(dplyr); library(tidyr)
tbl <- read_table(text)

tbl %>% 
  group_by(ID) %>% 
  mutate(Start_year = list(seq.int(Start_year, 2015L))) %>%
  # rename(new_col = Start_year)
  unnest()

data

text <- "ID    Start_year
01          1999
02          2004
03          2015
04          2007"

library(data.table)
DT <- fread(text)

Frank · Answer

If you have enough memory, you could take full set of IDs x years and filter with a rolling join:

res <- DT[
  CJ(ID, Start_year = seq.int(min(Start_year), 2015L)), 
  on=.(ID, Start_year), 
  roll=TRUE, 
  nomatch=0
]

setnames(res, "Start_year", "Year")[]

CJ takes the "cross join" of the vector of IDs and years. If you are not on the latest version of data.table, you may need to name both arguments (ie, CJ(ID = ID, Start_year = seq.int(min(Start_year), 2015L))).

Comment. The OP says @markus' approach already brings the operation down to seconds, so maybe further improvement is not needed... Also, I'm not really sure that there are any circumstances under which my approach would be faster.

Faster equivalent to group_by %>% expand in R

Tags:

performance

r

data-manipulation

data.table

dplyr

rw2

2 Answers

markus

Frank

Recent Activity

Donate For Us