<h3>My data</h3> <p>I have a data.table <code>DT</code> with the current (<code>F0YR</code>) and the next (<code>F1YR</code>) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become a current FYE, the integer will be both in the column <code>F1YR</code> and <code>F0YR</code>. Also, my data contains monthly observations so the same FYE will be in the data set multiple times:</p> <pre class="prettyprint"><code>library(data.table) DT <- data.table(ID = rep(c("A", "B"), each=9), MONTH = rep(100L:108L, times=2), F0YR = rep(c(1L, 4L, 7L), each=3, times=2), F1YR = rep(c(4L, 7L, 9L), each=3, times=2), value = c(rep(1:5, each=3), 6, 6, 7), key = "ID,F0YR") DT ID MONTH F0YR F1YR value [1,] A 100 1 4 1 [2,] A 101 1 4 1 [3,] A 102 1 4 1 [4,] A 103 4 7 2 [5,] A 104 4 7 2 [6,] A 105 4 7 2 [7,] A 106 7 9 3 [8,] A 107 7 9 3 [9,] A 108 7 9 3 [10,] B 100 1 4 4 [11,] B 101 1 4 4 ... </code></pre> <h3>What I want to do</h3> <p>For every <code>ID</code> and <code>F1YR</code> combination, I want to get the value for the <code>ID</code> and <code>F0YR</code> combination. As an example: Company A had a value of <code>2</code> for <code>FOYR==4</code>. Now, I want an additional column for all combinations with <code>ID=="A"</code> and <code>F1YR==4</code> which is set to 2, next to the already existent value of 1.</p> <h3>What I tried</h3> <pre class="prettyprint"><code>intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"] setkey(intDT, ID, F0YR) setkey(DT, ID, F1YR) DT <- intDT[DT] setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR")) DT ID F1YR valueNew MONTH F0YR value [1,] A 4 2 100 1 1 [2,] A 4 2 101 1 1 [3,] A 4 2 102 1 1 [4,] A 7 3 103 4 2 [5,] A 7 3 104 4 2 [6,] A 7 3 105 4 2 [7,] A 9 NA 106 7 3 [8,] A 9 NA 107 7 3 [9,] A 9 NA 108 7 3 [10,] B 4 5 100 1 4 [11,] B 4 5 101 1 4 ... </code></pre> <p>(Note that I use <code>mult="last"</code> here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker).</p> <h3>What I want</h3> <p>This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same <code>data.table</code>, all the column names have the same name and I have to rename them. I thought that a <code>self join</code> would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values).</p>

<p>In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your <code>DT</code>, and using v1.8.1 :</p> <pre class="prettyprint"><code>> agg = DT[,last(value),by=list(ID,F0YR)] > agg ID F0YR V1 1: A 1 1 2: A 4 2 3: A 7 3 4: B 1 4 5: B 4 5 6: B 7 7 </code></pre> <p>I called it <code>agg</code> because I couldn't think of a better name. In this case you wanted <code>last</code> which isn't really an aggregate as such, but you know what I mean.</p> <p>Then update <code>DT</code> by reference by group. Here we're grouping by <code>i</code>.</p> <pre class="prettyprint"><code>setkey(DT,ID,F1YR) DT[agg,newcol:=V1] ID MONTH F0YR F1YR value newcol 1: A 100 1 4 1 2 2: A 101 1 4 1 2 3: A 102 1 4 1 2 4: A 103 4 7 2 3 5: A 104 4 7 2 3 6: A 105 4 7 2 3 7: A 106 7 9 3 NA 8: A 107 7 9 3 NA 9: A 108 7 9 3 NA 10: B 100 1 4 4 5 11: B 101 1 4 4 5 12: B 102 1 4 4 5 13: B 103 4 7 5 7 14: B 104 4 7 5 7 15: B 105 4 7 5 7 16: B 106 7 9 6 NA 17: B 107 7 9 6 NA 18: B 108 7 9 7 NA </code></pre> <p>Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.</p>

How to best join one column of a data.table with another column of the same data.table?

My data

I have a data.table DT with the current (F0YR) and the next (F1YR) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become a current FYE, the integer will be both in the column F1YR and F0YR. Also, my data contains monthly observations so the same FYE will be in the data set multiple times:

library(data.table)
DT <- data.table(ID     = rep(c("A", "B"), each=9),
                 MONTH  = rep(100L:108L, times=2),
                 F0YR   = rep(c(1L, 4L, 7L), each=3, times=2),
                 F1YR   = rep(c(4L, 7L, 9L), each=3, times=2),
                 value  = c(rep(1:5, each=3), 6, 6, 7),
                 key    = "ID,F0YR")
DT
      ID MONTH F0YR F1YR value
 [1,]  A   100    1    4     1
 [2,]  A   101    1    4     1
 [3,]  A   102    1    4     1
 [4,]  A   103    4    7     2
 [5,]  A   104    4    7     2
 [6,]  A   105    4    7     2
 [7,]  A   106    7    9     3
 [8,]  A   107    7    9     3
 [9,]  A   108    7    9     3
[10,]  B   100    1    4     4
[11,]  B   101    1    4     4
...

What I want to do

For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. As an example: Company A had a value of 2 for FOYR==4. Now, I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1.

What I tried

intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
      ID F1YR valueNew MONTH F0YR value
 [1,]  A    4        2   100    1     1
 [2,]  A    4        2   101    1     1
 [3,]  A    4        2   102    1     1
 [4,]  A    7        3   103    4     2
 [5,]  A    7        3   104    4     2
 [6,]  A    7        3   105    4     2
 [7,]  A    9       NA   106    7     3
 [8,]  A    9       NA   107    7     3
 [9,]  A    9       NA   108    7     3
[10,]  B    4        5   100    1     4
[11,]  B    4        5   101    1     4
...

(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker).

What I want

This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table, all the column names have the same name and I have to rename them. I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values).

446

asked Jun 14 '12 09:06

Christoph_J

1 Answers

In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.

Then update DT by reference by group. Here we're grouping by i.

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.

155

answered Oct 20 '22 13:10

Matt Dowle

Related questions
                            
                                Start multiple h2o cluster from within R
                            
                                Tail recursion in R
                            
                                Is there a way to deal with nested data with sparklyr?
                            
                                Programmatically scraping a response header within R
                            
                                How to identify the function used by geom_smooth()
                            
                                sum non NA elements only, but if all NA then return NA
                            
                                Finding specific strings in an array using R
                            
                                R Shiny authentication using AWS Cognito
                            
                                fuzzy matching in R
                            
                                Stored Input values in shiny widgets?
                            
                                Understanding Keras prediction output of a rnn model in R
                            
                                Prevent pagebreak in kableExtra landscape table
                            
                                How to save a leaflet map with drawn shapes/points on it in Shiny?
                            
                                Write a loop to select all combination of variable values generating positive equation values in R
                            
                                R: how to sample without replacement AND without consecutive same values
                            
                                modify lm or loess function to use it within ggplot2's geom_smooth
                            
                                which list element is being processed when using snowfall::sfLapply?
                            
                                Handling field types in database interaction with R
                            
                                plotting and coloring data on irregular grid
                            
                                ggplot vertical line with date axis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to best join one column of a data.table with another column of the same data.table?

Tags:

r

data.table