Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to best join one column of a data.table with another column of the same data.table?

Tags:

r

data.table

My data

I have a data.table DT with the current (F0YR) and the next (F1YR) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become a current FYE, the integer will be both in the column F1YR and F0YR. Also, my data contains monthly observations so the same FYE will be in the data set multiple times:

library(data.table)
DT <- data.table(ID     = rep(c("A", "B"), each=9),
                 MONTH  = rep(100L:108L, times=2),
                 F0YR   = rep(c(1L, 4L, 7L), each=3, times=2),
                 F1YR   = rep(c(4L, 7L, 9L), each=3, times=2),
                 value  = c(rep(1:5, each=3), 6, 6, 7),
                 key    = "ID,F0YR")
DT
      ID MONTH F0YR F1YR value
 [1,]  A   100    1    4     1
 [2,]  A   101    1    4     1
 [3,]  A   102    1    4     1
 [4,]  A   103    4    7     2
 [5,]  A   104    4    7     2
 [6,]  A   105    4    7     2
 [7,]  A   106    7    9     3
 [8,]  A   107    7    9     3
 [9,]  A   108    7    9     3
[10,]  B   100    1    4     4
[11,]  B   101    1    4     4
...

What I want to do

For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. As an example: Company A had a value of 2 for FOYR==4. Now, I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1.

What I tried

intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
      ID F1YR valueNew MONTH F0YR value
 [1,]  A    4        2   100    1     1
 [2,]  A    4        2   101    1     1
 [3,]  A    4        2   102    1     1
 [4,]  A    7        3   103    4     2
 [5,]  A    7        3   104    4     2
 [6,]  A    7        3   105    4     2
 [7,]  A    9       NA   106    7     3
 [8,]  A    9       NA   107    7     3
 [9,]  A    9       NA   108    7     3
[10,]  B    4        5   100    1     4
[11,]  B    4        5   101    1     4
...

(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker).

What I want

This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table, all the column names have the same name and I have to rename them. I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values).

like image 446
Christoph_J Avatar asked Jun 14 '12 09:06

Christoph_J


People also ask

How do you merge cells with the same data?

Merge cellsClick the first cell and press Shift while you click the last cell in the range you want to merge. Important: Make sure only one of the cells in the range has data. Click Home > Merge & Center.


1 Answers

In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.

Then update DT by reference by group. Here we're grouping by i.

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.

like image 155
Matt Dowle Avatar answered Oct 20 '22 13:10

Matt Dowle