I'm learning R and I'm not sure if it makes sense to standardise on dplyr or data.table. Dplyr has really nice syntax, but as far as I understand it copies data frame on each operation, which is (or could be) a drawback.
One thing that I can't figure out is alternative for mutate.
if I have
df %>% group_by(foo) %>% mutate(
bar = cumsum(baz),
q = bar * 3.14)
I could do sth like
df[,c("bar"):=list(cumsum(baz)),by=foo]
df$q <- df$bar*3.14
Is there a better way of doing this in data.table?
Memory Usage (Efficiency)data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.
In my benchmarking project, Base R sorts a dataset much faster than dplyr or data.
You may do just this:
# some test data:
df <- data.table(baz = 1:10, foo = c(rep(1, 5), rep(2, 5)))
df[, bar := cumsum(baz), by = foo]
df[, q := bar*3.14]
While being in two lines, it is very readable and easy to write.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With