I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows: <pre class="prettyprint"><code>a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b"))) > a i j k 1: 1 2 a,b 2: 2 2 a,c 3: 3 6 b </code></pre> And I want to group based on the values in k. So something like this: <pre class="prettyprint"><code>a[, sum(j), by = k] </code></pre> right now I am getting the following error: <pre class="prettyprint"><code> Error in `[.data.table`(a, , sum(i), by = k) : The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3). </code></pre> The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be: <pre class="prettyprint"><code>k V1 a 4 b 8 c 2 </code></pre> Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.

I think this might work: <pre class="prettyprint"><code>a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k] k V1 1: a 4 2: b 8 3: c 2 </code></pre>

If we are using <code>tidyr</code>, a compact option would be <pre class="prettyprint"><code>library(tidyr) unnest(a, k)[, sum(j) ,k] # k V1 #1: a 4 #2: b 8 #3: c 2 </code></pre> <hr> Or using the <code>dplyr/tidyr</code> pipes <pre class="prettyprint"><code>unnest(a, k) %>% group_by(k) %>% summarise(V1 = sum(j)) # k V1 # <chr> <dbl> #1 a 4 #2 b 8 #3 c 2 </code></pre>

Since by-group operations can be slow, I'd consider... <pre class="prettyprint"><code>dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")] i j k 1: 1 2 a 2: 1 2 b 3: 2 2 a 4: 2 2 c 5: 3 6 b </code></pre> We're repeating rows of cols <code>i:j</code> to match the unlisted <code>k</code>. The data should be kept in this format instead of using a list column, probably. From there, as in @MikeyMike's answer, we can <code>dat[, sum(j), by=k]</code>. In data.table 1.9.7+, we can similarly do <pre class="prettyprint"><code>dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j] </code></pre>

Group a data.table using a column which is list

Tags:

r

data.table

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:

a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))

> a
  i j   k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6   b

And I want to group based on the values in k. So something like this:

a[, sum(j), by = k]

right now I am getting the following error:

 Error in `[.data.table`(a, , sum(i), by = k) : 
 The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).

The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:

k V1 
a 4
b 8
c 2

Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.

847

asked Jul 31 '16 15:07

newbie

3 Answers

I think this might work:

a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]

   k V1
1: a  4
2: b  8
3: c  2

115

answered Nov 02 '22 17:11

Mike H.

If we are using tidyr, a compact option would be

library(tidyr)
unnest(a, k)[, sum(j) ,k]
#   k V1
#1: a  4
#2: b  8
#3: c  2

Or using the dplyr/tidyr pipes

unnest(a, k) %>%
       group_by(k) %>%
       summarise(V1 = sum(j))
#     k    V1
#   <chr> <dbl>
#1     a     4
#2     b     8
#3     c     2

answered Nov 02 '22 16:11

akrun

Since by-group operations can be slow, I'd consider...

dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]

   i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b

We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in @MikeyMike's answer, we can dat[, sum(j), by=k].

In data.table 1.9.7+, we can similarly do

dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

answered Nov 02 '22 18:11

Frank

Related questions
                            
                                How do keep only unique words within each string in a vector
                            
                                stat_density2d: removed rows containing non-finite values
                            
                                R loses information when saving plot as encapsulated postscript (.eps)
                            
                                Unable to install XML package in R on CentOS
                            
                                Paste all combinations of a vector in R
                            
                                Keep only groups of data with multiple observations
                            
                                How to find the package name in R for a specific function?
                            
                                r - Use tab as part of seperator
                            
                                ggplot: combining size and color in legend
                            
                                R - Calculate Time Elapsed Since Last Event with Multiple Event Types
                            
                                Detecting whether shiny runs the R code
                            
                                overlapping intervals in a dataframe in r
                            
                                Build a square adjacency matrix from data.frame or data.table
                            
                                Shiny: use shinyjs to fetch cookie data
                            
                                Control the appearance of a sliderInput in Shiny
                            
                                Extract time from timestamp?
                            
                                Summarize different Columns with different Functions
                            
                                Group value in range r
                            
                                Interactive scatter plots in R, overlay/hover summary/tooltip as user supplied plot function
                            
                                Restrict input type in shiny field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With