I would like to turn the first table into the second by selecting the last observation of a group for <code>a</code> and <code>b</code>, the first observation for <code>c</code>, sum each observation for the group for <code>d</code> and <code>e</code>, and for <code>f</code>, check if a valid date exists and use that date. Table 1: <pre class="prettyprint"><code>ID a b c d e f 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976 </code></pre> Table 2: <pre class="prettyprint"><code>ID a b c d e f 1 11 111 1000 30030 300300 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 60010 600100 20/12/1978 4 44 444 4000 240150 2401500 7/06/1944 5 55 555 5000 100010 1000100 31/05/1976 </code></pre> I have looked up StackOverflow questions and I have only seen elements of this. I can do a through to e in the following steps. <pre class="prettyprint"><code>library(data.table) setwd('D:/Work/BRB/StackOverflow') DT = data.table(fread('datatable.csv', header=TRUE)) AB = DT[ , .SD[.N], ID ] AB = AB[ , c('a', 'b') ] C = DT[ , .SD[1], ID ] C = C[ , 'c' ] DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ] Final = cbind(AB, C, DE) Final </code></pre> My question is, can I do the operations on variables <code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>e</code> in one transformation without having to split it into 3? Also, I have no idea how to do <code>f</code>. Any suggestions? Finally, I am new to R. Anything else I can improve about my code?

There are several things you can improve: <ol> <li> <code>fread</code> will return a data.table, so no need to wrap it in <code>data.table</code>. You can check with <code>class(DT)</code>.</li> <li>Use the <code>na.strings</code> parameter when reading in the data. See below for an example.</li> <li> Summarise with: <pre class="prettyprint"><code>DT[, .(a = a[.N], b = b[.N], c = c[1], d = sum(d), e = sum(e), f = unique(na.omit(f))) , by = ID] </code></pre> </li> </ol> you will then get: <blockquote> <pre class="prettyprint"><code> ID a b c d e f 1: 1 11 111 1000 30030 300300 5/07/1977 2: 2 22 222 2000 20000 200000 6/02/1980 3: 3 33 333 3000 60010 600100 20/12/1978 4: 4 44 444 4000 240150 2401500 7/06/1944 5: 5 55 555 5000 100010 1000100 31/05/1976 </code></pre> </blockquote> Some explanations & other notes: <ul> <li>Subsetting with <code>[1]</code> will give you the first value of a group. You could also use the <code>first</code>-function which is optimized in data.table, and thus faster.</li> <li>Subsetting with <code>[.N]</code> will give you the last value of a group. You could also use the <code>last</code>-function which is optimized in data.table, and thus faster.</li> <li>Don't use variable names that are also functions in R (in this case, don't use <code>c</code> as a variable name). See also <code>?c</code> for an explanation of what the <code>c</code>-function does.</li> <li>For summarising the <code>f</code>-variable, I used <code>unique</code> in combination with <code>na.omit</code>. If there is more than one unique date by <code>ID</code>, you could also use for example <code>na.omit(f)[1]</code>.</li> </ul> <hr> If speed is an issue, you could optimize the above to (thx to @Frank): <pre class="prettyprint"><code>DT[order(f) , .(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = first(f)) , by = ID] </code></pre> Ordering by <code>f</code> will put <code>NA</code>-values last. As a result now the internal GForce-optimization is used for all calculations. <hr> Used data: <pre class="prettyprint"><code>DT <- fread("ID a b c d e f 1 10 100 1000 10000 100000 ? 1 10 100 1001 10010 100100 5/07/1977 1 11 111 1002 10020 100200 5/07/1977 2 22 222 2000 20000 200000 6/02/1980 3 33 333 3000 30000 300000 20/12/1978 3 33 333 3001 30010 300100 ? 4 40 400 4000 40000 400000 ? 4 40 400 4001 40010 400100 ? 4 40 400 4002 40020 400200 7/06/1944 4 44 444 4003 40030 400300 ? 4 44 444 4004 40040 400400 ? 4 44 444 4005 40050 400500 ? 5 55 555 5000 50000 500000 31/05/1976 5 55 555 5001 50010 500100 31/05/1976", na.strings='?') </code></pre>

We can use <code>tidyverse</code>. After grouping by 'ID', we <code>summarise</code> the columns based on the <code>first</code> or <code>last</code> observation <pre class="prettyprint"><code>library(dplyr) DT %>% group_by(ID) %>% summarise(a = last(a), b = last(b), c = first(c), d = sum(d), e = sum(e), f = f[f!="?"][1]) # A tibble: 5 × 7 # ID a b c d e f # <int> <int> <int> <int> <int> <int> <chr> #1 1 11 111 1000 30030 300300 5/07/1977 #2 2 22 222 2000 20000 200000 6/02/1980 #3 3 33 333 3000 60010 600100 20/12/1978 #4 4 44 444 4000 240150 2401500 7/06/1944 #5 5 55 555 5000 100010 1000100 31/05/1976 </code></pre>

Select nth observation and sum by group using data.table

Q: What does n do in R?

n: The number of observations in the current group. This function is implemented specifically for each data source and can only be used from within summarise() , mutate() and filter() .

I would like to turn the first table into the second by selecting the last observation of a group for a and b, the first observation for c, sum each observation for the group for d and e, and for f, check if a valid date exists and use that date.

Table 1:

ID   a    b    c        d        e          f
1   10  100 1000    10000   100000  ?
1   10  100 1001    10010   100100  5/07/1977
1   11  111 1002    10020   100200  5/07/1977
2   22  222 2000    20000   200000  6/02/1980
3   33  333 3000    30000   300000  20/12/1978
3   33  333 3001    30010   300100  ?
4   40  400 4000    40000   400000  ?
4   40  400 4001    40010   400100  ?
4   40  400 4002    40020   400200  7/06/1944
4   44  444 4003    40030   400300  ?
4   44  444 4004    40040   400400  ?
4   44  444 4005    40050   400500  ?
5   55  555 5000    50000   500000  31/05/1976
5   55  555 5001    50010   500100  31/05/1976

Table 2:

ID   a    b    c         d        e          f
1   11  111 1000     30030   300300  5/07/1977
2   22  222 2000     20000   200000  6/02/1980
3   33  333 3000     60010   600100 20/12/1978
4   44  444 4000    240150  2401500  7/06/1944
5   55  555 5000    100010  1000100 31/05/1976

I have looked up StackOverflow questions and I have only seen elements of this. I can do a through to e in the following steps.

library(data.table)

setwd('D:/Work/BRB/StackOverflow')

DT = data.table(fread('datatable.csv', header=TRUE))

AB = DT[ , .SD[.N], ID ]
AB = AB[ , c('a', 'b') ]

C = DT[ , .SD[1], ID ]
C = C[ , 'c' ]
DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ]

Final = cbind(AB, C, DE)
Final

My question is, can I do the operations on variables a, b, c, d, e in one transformation without having to split it into 3?

Also, I have no idea how to do f. Any suggestions?

Finally, I am new to R. Anything else I can improve about my code?

How do I sum a group in R?

Group By Sum in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by sum in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df results to get the group by sum.

What is .n in data table?

Think of .N as a variable for the number of instances. For example: dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7) dt[.N] # returns the last row # a b # 1: C 7.

What does n do in R?

n: The number of observations in the current group. This function is implemented specifically for each data source and can only be used from within summarise() , mutate() and filter() .

There are several things you can improve:

fread will return a data.table, so no need to wrap it in data.table. You can check with class(DT).
Use the na.strings parameter when reading in the data. See below for an example.

Summarise with:

DT[, .(a = a[.N], 
       b = b[.N], 
       c = c[1], 
       d = sum(d), 
       e = sum(e), 
       f = unique(na.omit(f)))
   , by = ID]

you will then get:

   ID  a   b    c      d       e          f
1:  1 11 111 1000  30030  300300  5/07/1977
2:  2 22 222 2000  20000  200000  6/02/1980
3:  3 33 333 3000  60010  600100 20/12/1978
4:  4 44 444 4000 240150 2401500  7/06/1944
5:  5 55 555 5000 100010 1000100 31/05/1976

Some explanations & other notes:

Subsetting with [1] will give you the first value of a group. You could also use the first-function which is optimized in data.table, and thus faster.
Subsetting with [.N] will give you the last value of a group. You could also use the last-function which is optimized in data.table, and thus faster.
Don't use variable names that are also functions in R (in this case, don't use c as a variable name). See also ?c for an explanation of what the c-function does.
For summarising the f-variable, I used unique in combination with na.omit. If there is more than one unique date by ID, you could also use for example na.omit(f)[1].

If speed is an issue, you could optimize the above to (thx to @Frank):

DT[order(f)
   , .(a = last(a), 
       b = last(b), 
       c = first(c), 
       d = sum(d), 
       e = sum(e), 
       f = first(f))
   , by = ID]

Ordering by f will put NA-values last. As a result now the internal GForce-optimization is used for all calculations.

Used data:

DT <- fread("ID   a    b    c        d        e          f
             1   10  100 1000    10000   100000  ?
             1   10  100 1001    10010   100100  5/07/1977
             1   11  111 1002    10020   100200  5/07/1977
             2   22  222 2000    20000   200000  6/02/1980
             3   33  333 3000    30000   300000  20/12/1978
             3   33  333 3001    30010   300100  ?
             4   40  400 4000    40000   400000  ?
             4   40  400 4001    40010   400100  ?
             4   40  400 4002    40020   400200  7/06/1944
             4   44  444 4003    40030   400300  ?
             4   44  444 4004    40040   400400  ?
             4   44  444 4005    40050   400500  ?
             5   55  555 5000    50000   500000  31/05/1976
             5   55  555 5001    50010   500100  31/05/1976", na.strings='?')

We can use tidyverse. After grouping by 'ID', we summarise the columns based on the first or last observation

library(dplyr) 
DT %>% 
   group_by(ID) %>% 
   summarise(a = last(a),
             b = last(b), 
             c = first(c), 
             d = sum(d), 
             e = sum(e), 
             f = f[f!="?"][1])
# A tibble: 5 × 7
#     ID     a     b     c      d       e          f
#  <int> <int> <int> <int>  <int>   <int>      <chr>
#1     1    11   111  1000  30030  300300  5/07/1977
#2     2    22   222  2000  20000  200000  6/02/1980
#3     3    33   333  3000  60010  600100 20/12/1978
#4     4    44   444  4000 240150 2401500  7/06/1944
#5     5    55   555  5000 100010 1000100 31/05/1976

Select nth observation and sum by group using data.table

Tags:

r

data.table

brb

People also ask

2 Answers

Jaap

akrun

Recent Activity

Donate For Us

Select nth observation and sum by group using data.table

Tags:

r

data.table

brb

People also ask

2 Answers

Jaap

akrun

Related questions

Recent Activity

Donate For Us