Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Table merge based on date ranges

Tags:

r

data.table

I have two tables, policies and claims

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), 
                EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
   policyNumber       EFDT       EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01


claims<-data.table(claimNumber=c(1,2,3,4), 
                   policyNumber=c(123,123,123,124),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
                   claimAmount=c(10,20,20,15))
> claims
   claimNumber policyNumber   lossDate claimAmount
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15

The policy table really contains policy-terms, since each row is uniquely identified by a policy number along with an effective date.

I want to merge the two tables in a way that associates claims with policy-terms. A claim is associated with a policy term if it has the same policy number and the lossDate of the claim falls within the effective date and expiration date of the policy-term (effective dates are inclusive bounds and expiration dates are exclusive bounds.) How do I merge the tables in this way?

This should be similar to a left outer join. The result should look like

   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:          125 2013-02-01 2014-02-01          NA       <NA>          NA
like image 730
Ben Avatar asked Feb 04 '14 18:02

Ben


2 Answers

Version 1 (updated for data.table v1.9.4+)

Try this:

# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126), 
                     EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                     EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))

# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                   policyNumber=c(123,123,123,124,126,126),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                   claimAmount=c(10,20,20,15,5,25))

# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)

# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]

# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]

# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)

# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))

ans
#   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
#1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
#2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
#3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
#4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
#5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
#6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
#7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

Version 2

@Arun suggested using the new foverlaps function from data.table. My attempt below seems harder, not easier, so please let me know how to improve it.

## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate]  ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)

## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]

## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]

## Remove duplicates (such as policyNumber==123 & claimNumber==3),
##   and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
  unique(ans2), ## select only the unique values
  policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)

ans2
##    policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
## 1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
## 2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
## 3:          123 2012-01-01 2013-01-01           3 2013-01-01          20     TRUE
## 4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
## 5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
## 6:          126       <NA>       <NA>           6 2014-03-01          25    FALSE
## 7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

Version 3

Using foverlaps(), another version:

require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))

foverlaps() requires both start and end ranges/intervals. Therefore, we duplicate lossDate column on to lossDate2.

Since EXDT needs to be open interval, we subtract one from it, and place it in a new column EXDTclosed.

Now, we set the key. foverlaps() requires the last two key columns to be intervals. So they're specified last. And we also want overlapping join to first match by policyNumber. Hence, it's also specified in the key.

We need to set key on claims (check ?foverlaps). We don't have to set key on policies. But you can if you wish (then you can skip by.x argument as it by default takes the key value). Since we don't set the key for policies here, we'll specify explicitly the corresponding columns in by.x argument. The overlap type by default is any, which we don't have to change (and therefore not specified). This results in:

#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31
like image 137
dnlbrky Avatar answered Nov 03 '22 10:11

dnlbrky


I think this does mostly what you want. I need to run so don't have time to add the policy with no claims and clean the columns up, but I think the difficult issues are addressed:

setkey(policies, policyNumber, EXDT)
policies[, EXDT2:=EXDT]
policies[claims[, list( policyNumber, lossDate, lossDate, claimNumber, claimAmount)], roll=-Inf]
#    policyNumber       EXDT       EFDT      EXDT2   lossDate claimNumber claimAmount
# 1:          123 2012-02-01 2012-01-01 2013-01-01 2012-02-01           1          10
# 2:          123 2012-08-15 2012-01-01 2013-01-01 2012-08-15           2          20
# 3:          123 2013-01-01 2012-01-01 2013-01-01 2013-01-01           3          20
# 4:          124 2013-10-31 2013-01-01 2014-01-01 2013-10-31           4          15

Also, note it is trivial to remove/highlight claims outside of policy dates from this result.

like image 28
BrodieG Avatar answered Nov 03 '22 10:11

BrodieG