Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break data.table chain into two lines of code for readability

I'm working on a Rmarkdown document, and was told to strictly limit to a maximum number of columns (margin column) of 100. In the document's code chunks I used many different packages, among which is data.table.

In order to comply with the limit I can split chains (and even long commands) like:

p <- ggplot(foo,aes(bar,foo2))+
       geom_line()+
       stat_smooth()
bar <- sum(long_variable_name_here,
         na.rm=TRUE)
foo <- bar %>% 
         group_by(var) %>%
         summarize(var2=sum(foo2))

but I can't split a data.table chain, as it produces an error. How can I achieve something like this?

bar <- foo[,.(long_name_here=sum(foo2)),by=var]
           [order(-long_name_here)]

Last line, of course, causes an error. Thanks!

like image 435
PavoDive Avatar asked Nov 17 '15 16:11

PavoDive


3 Answers

You have to give a return between the [ and ] of each line. An example for how to divide your data.table code over several lines:

bar <- foo[, .(long_name_here = sum(foo2)), by = var
           ][order(-long_name_here)]

You can also give a return before / after each comma. An example with a return before the comma (my preference):

bar <- foo[, .(long_name_here = sum(foo2))
           , by = var
           ][order(-long_name_here)
             , long_name_2 := long_name_here * 10]

See this answer for an extended example

like image 155
Jaap Avatar answered Oct 13 '22 07:10

Jaap


Chaining data.tables with magrittr

I have a method I'm using, with magrittr, using the . object with [:

library(magrittr)
library(data.table)

bar <- foo %>%
        .[etcetera] %>%
        .[etcetera] %>%
        .[etcetera]

working example:

out <- data.table(expand.grid(x = 1:10,y = 1:10))
out %>% 
  .[,z := x*y] %>% 
  .[,w := x*z] %>% 
  .[,v := w*z]
print(out)

Additional examples

Edit: it's also not just syntactic sugar, since it allows you to refer to the table from the previous step as ., which means that you can do a self join,

or you can use %T>% for some logging in-between steps (using futile.logger or the like):

out %>%
 .[etcetera] %>%
 .[etcetera] %T>% 
 .[loggingstep] %>%
 .[etcetera] %>%
 .[., on = SOMEVARS, allow.cartesian = TRUE]

EDIT:

This is much later, and I still use this regularly. But I have the following caveat:

magrittr adds overhead

I really like doing this at the top level of a script. It has a very clear and readable flow, and there are a number of neat tricks you can do with it.

But I've had to remove this before when optimizing if it's part of a function that's being called lots of times.

You're better off chaining data.tables the old fashioned way in that case.

like image 45
Shape Avatar answered Oct 13 '22 06:10

Shape


For many years, the way that automatic indentation in RStudio mis-aligns data.table pipes has been a source of frustration to me. I only recently realized that there is a neat way to get around this, simply by enclosing the piped operations in parentheses.

Here's a simple example:

x <- data.table(a = letters, b = LETTERS[1:5], c = rnorm(26))
y <- (
  x
  [, c := round(c, 2)]
  [sample(26)]
  [, d := paste(a,b)]
  [, .(d, foo = mean(c)), by = b]
  )

Why does this work? Because the un-closed parenthesis signals to the R interpreter that the current line is still not complete, and therefore the whole pipe is treated in the same way as a continuous line of code.

like image 36
dww Avatar answered Oct 13 '22 05:10

dww