I have the following matching problem: I have two data.frames, one with an observation every month (per company ID), and one with an observation every quarter (per company ID; note that quarter means fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily correct and also, a fiscal quarter is not necessarily 3 month long).
For every month and company, I want to get the correct value of that quarter. Consequently, several months have the same value for one quarter. As an example see the code below:
monthlyData <- data.frame(ID = rep(c("A", "B"), each = 5),
Month = rep(1:5, times = 2),
MonValue = 1:10)
monthlyData
ID Month MonValue
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 A 5 5
6 B 1 6
7 B 2 7
8 B 3 8
9 B 4 9
10 B 5 10
#Quarterly data, i.e. the value of every quarter has to be matched to several months in d1
#However, I want to match fiscal quarters, which means that one quarter is not necessarily 3 month long
qtrData <- data.frame(ID = rep(c("A", "B"), each = 2),
startMonth = c(1, 4, 1, 3),
endMonth = c(3, 5, 2, 5),
QTRValue = 1:4)
qtrData
ID startMonth endMonth QTRValue
1 A 1 3 1
2 A 4 5 2
3 B 1 2 3
4 B 3 5 4
#Desired output
ID Month MonValue QTRValue
1 A 1 1 1
2 A 2 2 1
3 A 3 3 1
4 A 4 4 2
5 A 5 5 2
6 B 1 6 3
7 B 2 7 3
8 B 3 8 4
9 B 4 9 4
10 B 5 10 4
Note: This question was posted on R-help months ago, but I didn't get any answer then and found a solution myself (see R-help). Now, however, I posted a question on stackoverflow where I have a question regarding the data.table
where this problem was mentioned as well and there, Andrie asked me to post this question again because he apparently has a good solution for it (see Question on SO)
UPDATE: See Matthew Dowle's comment: how does the real data look?
This data is a more realistic one. I added a few rows, but the only main part that changed is column endMonth
in qtrData
. More precisely, the startMonth
is not necessarily the endMonth
of the previous quarter plus one month anymore. Therefore, using the roll
option, I think that you need another line of code (if not, you get 20 rows back, but with Andrie's solution, which is the desired one, you get 17 rows back). Then there is no performance difference anymore, if I don't miss anything here.
monthlyData_new <- data.table(ID = rep(c("A", "B"), each = 10),
Month = rep(1:10, times = 2),
MonValue = 1:20)
qtrData_new <- data.table(ID = rep(c("A", "B"), each = 3),
startMonth = c(1, 4, 7, 1, 3, 8),
endMonth = c(3, 5, 10, 2, 5, 10),
QTRValue = 1:6)
setkey(qtrData_new, ID)
setkey(monthlyData_new, ID)
qtrData1 <- qtrData_new
setkey(qtrData1, ID, startMonth)
monthlyData1 <- monthlyData_new
setkey(monthlyData1, ID, Month)
withTable1 <- function(){
xx <- qtrData1[monthlyData1, roll=TRUE]
xx <- xx[startMonth <= endMonth]
}
withTable2 <- function(){
yy <- monthlyData_new[qtrData_new][Month >= startMonth & Month <= endMonth]
}
benchmark(withTable1, withTable2, replications=1e6)
test replications elapsed relative user.self sys.self user.child sys.child
1 withTable1 1000000 4.244 1.028599 4.232 0.008 0 0
2 withTable2 1000000 4.126 1.000000 4.096 0.028 0 0
Try this :
mD = data.table(monthlyData, key="ID,Month")
qD = data.table(qtrData,key="ID,startMonth")
qD[mD,roll=TRUE]
ID startMonth endMonth QTRValue MonValue
[1,] A 1 3 1 1
[2,] A 2 3 1 2
[3,] A 3 3 1 3
[4,] A 4 5 2 4
[5,] A 5 5 2 5
[6,] B 1 2 3 6
[7,] B 2 2 3 7
[8,] B 3 5 4 8
[9,] B 4 5 4 9
[10,] B 5 5 4 10
That should be much faster.
EDIT : Answering the follow-up edit in question. One way is use NA to store where the missing months are. I find it easier to look at one time series column (irregular with gaps and NA), than two making a series of ranges.
> mD <- data.table(ID = rep(c("A", "B"), each = 10),
+ Month = rep(1:10, times = 2),
+ MonValue = 1:20, key="ID,Month")
>
> qD <- data.table(ID = rep(c("A", "B"), each = 4),
+ Month = c(1,4,6,7, 1,3,6,8),
+ QtrValue = c(1,2,NA,3, 4,5,NA,6),
+ key="ID,Month")
>
> mD
ID Month MonValue
[1,] A 1 1
[2,] A 2 2
[3,] A 3 3
[4,] A 4 4
[5,] A 5 5
[6,] A 6 6
[7,] A 7 7
[8,] A 8 8
[9,] A 9 9
[10,] A 10 10
[11,] B 1 11
[12,] B 2 12
[13,] B 3 13
[14,] B 4 14
[15,] B 5 15
[16,] B 6 16
[17,] B 7 17
[18,] B 8 18
[19,] B 9 19
[20,] B 10 20
> qD
ID Month QtrValue
[1,] A 1 1
[2,] A 4 2
[3,] A 6 NA # missing for 1 month (6)
[4,] A 7 3
[5,] B 1 4
[6,] B 3 5
[7,] B 6 NA # missing for 2 months (6 and 7)
[8,] B 8 6
> qD[mD,roll=TRUE]
ID Month QtrValue MonValue
[1,] A 1 1 1
[2,] A 2 1 2
[3,] A 3 1 3
[4,] A 4 2 4
[5,] A 5 2 5
[6,] A 6 NA 6
[7,] A 7 3 7
[8,] A 8 3 8
[9,] A 9 3 9
[10,] A 10 3 10
[11,] B 1 4 11
[12,] B 2 4 12
[13,] B 3 5 13
[14,] B 4 5 14
[15,] B 5 5 15
[16,] B 6 NA 16
[17,] B 7 NA 17
[18,] B 8 6 18
[19,] B 9 6 19
[20,] B 10 6 20
> qD[mD,roll=TRUE][!is.na(QtrValue)]
ID Month QtrValue MonValue
[1,] A 1 1 1
[2,] A 2 1 2
[3,] A 3 1 3
[4,] A 4 2 4
[5,] A 5 2 5
[6,] A 7 3 7
[7,] A 8 3 8
[8,] A 9 3 9
[9,] A 10 3 10
[10,] B 1 4 11
[11,] B 2 4 12
[12,] B 3 5 13
[13,] B 4 5 14
[14,] B 5 5 15
[15,] B 8 6 18
[16,] B 9 6 19
[17,] B 10 6 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With