I have monthly data in one <code>data.table</code> and annual data in another <code>data.table</code> and now I want to match the annual data to the respective observation in the monthly data. My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great. Here is an exemplatory <code>data.table DT</code> for my annual data and how I currently duplicate: <pre class="prettyprint"><code>library(data.table) DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"), values = 10:15, startMonth = seq(from=1, by=2, length=6), endMonth = seq(from=3, by=3, length=6)) DT ID values startMonth endMonth [1,] a_1 10 1 3 [2,] a_2 11 3 6 [3,] a_3 12 5 9 [4,] b_1 13 7 12 [5,] b_2 14 9 15 [6,] b_3 15 11 18 #1. Alternative DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"] setkey(DT, ID) setkey(DT1, ID) DT1[DT] ID MONTH values startMonth endMonth a_1 1 10 1 3 a_1 2 10 1 3 a_1 3 10 1 3 a_2 3 11 3 6 [...] </code></pre> The last join is exactly what I want. However, <code>DT[, list(MONTH=startMonth:endMonth), by="ID"]</code> already does everything I want except adding the other columns to <code>DT</code>, so I was wondering if I could get rid of the last three rows in my code, i.e. the <code>setkey</code> and <code>join</code> operations. It turns out, you can, just do the following: <pre class="prettyprint"><code>#2. Alternative: More intuitiv and just one line of code DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"] ID MONTH values startMonth endMonth a_1 1 10 1 3 a_1 2 10 1 3 a_1 3 10 1 3 a_2 3 11 3 6 ... </code></pre> This, however, only works because I hardcoded the column names into the <code>list</code> expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell <code>data.table</code> to return the column <code>MONTH</code> that I compute as shown above and all the other columns of <code>DT</code>. <code>.SD</code> seemed to be able to do the trick, but: <pre class="prettyprint"><code>DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"] Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") : maxn (4) is not exact multiple of this j column's length (3) </code></pre> So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of <code>data.table</code> and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using <code>.SD</code>. I thought it is just any easy way to tell <code>data.table</code> that you want all columns. What do I miss?

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make <code>list</code> columns. In this case it's trying to make one <code>list</code> column out of <code>.SD</code> (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks. In the meantime, try : <pre class="prettyprint"><code>DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"] ID MONTH values startMonth endMonth a_1 1 10 1 3 a_1 2 10 1 3 a_1 3 10 1 3 a_2 3 11 3 6 ... </code></pre> Also, just to check you've seen <code>roll=TRUE</code>? Typically you'd have just one startMonth column (irregular with gaps) and then just <code>roll</code> join to it. Your example data has overlapping month ranges though, so that complicates it.

Is my way of duplicating rows in data.table efficient?

Tags:

r

data.table

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.

My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.

Here is an exemplatory data.table DT for my annual data and how I currently duplicate:

library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
                    values = 10:15,
                    startMonth = seq(from=1, by=2, length=6),
                    endMonth = seq(from=3, by=3, length=6))
DT
      ID values startMonth endMonth
[1,] a_1     10          1        3
[2,] a_2     11          3        6
[3,] a_3     12          5        9
[4,] b_1     13          7       12
[5,] b_2     14          9       15
[6,] b_3     15         11       18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT,  ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
[...]

The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:

#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
 ID MONTH values startMonth endMonth
a_1    1     10          1        3
a_1    2     10          1        3
a_1    3     10          1        3
a_2    3     11          3        6
...

This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:

DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") : 
  maxn (4) is not exact multiple of this j column's length (3)

So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?

294

asked Nov 04 '11 13:11

Christoph_J

2 Answers

Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).

library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))

DT[,rep:=1L][c(2,7),rep:=c(2L,3L)]   # duplicate row 2 and triple row 7
DT[,num:=1:.N]                       # to group each row by itself

DT
   x y rep num
1: 1 1   1   1
2: 1 1   2   2
3: 1 2   1   3
4: 1 3   1   4
5: 2 1   1   5
6: 2 1   1   6
7: 3 2   3   7

DT[,cbind(.SD,dup=1:rep),by="num"]
    num x y rep dup
 1:   1 1 1   1   1
 2:   2 1 1   1  NA      # why these NA?
 3:   2 1 1   2  NA
 4:   3 1 2   1   1
 5:   4 1 3   1   1
 6:   5 2 1   1   1
 7:   6 2 1   1   1
 8:   7 3 2   3   1
 9:   7 3 2   3   2
10:   7 3 2   3   3

Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :

DT[rep(num,rep)]
    x y rep num
 1: 1 1   1   1
 2: 1 1   2   2
 3: 1 1   2   2
 4: 1 2   1   3
 5: 1 3   1   4
 6: 2 1   1   5
 7: 2 1   1   6
 8: 3 2   3   7
 9: 3 2   3   7
10: 3 2   3   7

where in this example data the column rep happens to be the same name as the rep() base function.

answered Oct 26 '22 07:10

statquant

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.

In the meantime, try :

DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
 ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
...

Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.

answered Oct 26 '22 07:10

Matt Dowle

Related questions
                            
                                How do I filter a data.frame in R by categorical variable?
                            
                                R draw kmeans clustering with heatmap
                            
                                Adding an element (vector) to a list in rpy2
                            
                                Plotting temporal TS and omitting NA data
                            
                                spplot() - make color.key look nice
                            
                                How to plot two lines in ggplot2
                            
                                Changing the Sweave driver from the command line
                            
                                accessing Facebook API from R for Text Mining
                            
                                (console) user interaction in R?
                            
                                define class methods and class variables in R5 reference class
                            
                                How to extract the pixel data Use R's pixmap package?
                            
                                How to page multiple plots in R in separate jpeg files?
                            
                                How do I add citations and a bibliography to "Rpres" rmarkdown presentations?
                            
                                Is it possible to use non-imported packages in a package vignette?
                            
                                Require minimum version of R package
                            
                                Change letter case of column names
                            
                                Why is my recursive function so slow in R?
                            
                                Finding 2 & 3 word Phrases Using R TM Package
                            
                                Matching timestamped data to closest time in another dataset. Properly vectorized? Faster way?
                            
                                RSelenium: server signals port is already in use

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With