I'm working on a Rmarkdown document, and was told to strictly limit to a maximum number of columns (margin column) of 100. In the document's code chunks I used many different packages, among which is data.table
.
In order to comply with the limit I can split chains (and even long commands) like:
p <- ggplot(foo,aes(bar,foo2))+
geom_line()+
stat_smooth()
bar <- sum(long_variable_name_here,
na.rm=TRUE)
foo <- bar %>%
group_by(var) %>%
summarize(var2=sum(foo2))
but I can't split a data.table
chain, as it produces an error. How can I achieve something like this?
bar <- foo[,.(long_name_here=sum(foo2)),by=var]
[order(-long_name_here)]
Last line, of course, causes an error. Thanks!
You have to give a return between the [
and ]
of each line. An example for how to divide your data.table code over several lines:
bar <- foo[, .(long_name_here = sum(foo2)), by = var
][order(-long_name_here)]
You can also give a return before / after each comma. An example with a return before the comma (my preference):
bar <- foo[, .(long_name_here = sum(foo2))
, by = var
][order(-long_name_here)
, long_name_2 := long_name_here * 10]
See this answer for an extended example
Chaining data.tables with magrittr
I have a method I'm using, with magrittr, using the .
object with [
:
library(magrittr)
library(data.table)
bar <- foo %>%
.[etcetera] %>%
.[etcetera] %>%
.[etcetera]
working example:
out <- data.table(expand.grid(x = 1:10,y = 1:10))
out %>%
.[,z := x*y] %>%
.[,w := x*z] %>%
.[,v := w*z]
print(out)
Additional examples
Edit: it's also not just syntactic sugar, since it allows you to refer to the table from the previous step as .
, which means that you can do a self join,
or you can use %T>%
for some logging in-between steps (using futile.logger or the like):
out %>%
.[etcetera] %>%
.[etcetera] %T>%
.[loggingstep] %>%
.[etcetera] %>%
.[., on = SOMEVARS, allow.cartesian = TRUE]
EDIT:
This is much later, and I still use this regularly. But I have the following caveat:
magrittr adds overhead
I really like doing this at the top level of a script. It has a very clear and readable flow, and there are a number of neat tricks you can do with it.
But I've had to remove this before when optimizing if it's part of a function that's being called lots of times.
You're better off chaining data.tables the old fashioned way in that case.
For many years, the way that automatic indentation in RStudio mis-aligns data.table pipes has been a source of frustration to me. I only recently realized that there is a neat way to get around this, simply by enclosing the piped operations in parentheses.
Here's a simple example:
x <- data.table(a = letters, b = LETTERS[1:5], c = rnorm(26))
y <- (
x
[, c := round(c, 2)]
[sample(26)]
[, d := paste(a,b)]
[, .(d, foo = mean(c)), by = b]
)
Why does this work? Because the un-closed parenthesis signals to the R interpreter that the current line is still not complete, and therefore the whole pipe is treated in the same way as a continuous line of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With