Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a visual explanation of why data.table operations are faster than tidyverse operations when you need to group by a variable?

Tags:

r

data.table

I understand from excellent resources here, here and here that data.table utilises automatic indexing (to create a key i.e. supercharged row names) and binary search based subset in contrast to tidyverse, which relies on vector scanning.

I understand that vector scanning requires scanning each individual row and the creation of nrow(dataset) length logical vectors, and that doing this repeatedly is not as efficient.

I'm wondering if someone can help me frame exactly how these two methods means that data.table operations run a lot faster compared to tidyverse when you need to group by a variable. I.e. is it because data.table automatically indexes the group_by column and breaks it into grouped subsets and runs operations on each subset, whilst a vector scanning approach would require the generation of n = unique groups of multiple logical vectors, and then run operations on each individual logical vector, before collating results?

visual diagram

Also, according to the data.table vignette,

We can set keys on multiple columns and the column can be of different types...

Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

What does it mean that we can set keys on multiple columns and yet a data.table can have at most one key? I.e. is it that during any moment when running an operation, there is only one reference key, but which column the reference key is set as can change as we progress to another component of the overall operation?

Thank you in advance!

like image 261
EMD Avatar asked Apr 20 '20 12:04

EMD


1 Answers

There is no.

There are different ways to finding groups, and then to compute expression by groups. Each single thing can be differently implemented. They are not related to keys or index. Also data.table is not automatically creating key/index during group by (as of now).

data.table has very fast, carefully implemented, order function, it is being used to find groups. It was contributed to base R later on. There is an idea to use it in dplyr to speed up grouping: https://github.com/tidyverse/dplyr/issues/4406
Yet data.table order function got improved since then and now scales even better.

Aside from finding groups, there is a part about computing an expression. If we evaluate "user defined function" it will always be much slower. Many common functions are internally optimized, so they don't switch between R and C for every group. Here, data.table has also very carefully implemented "GForce" functions. Not sure but in dplyr they are called "hybrid evaluation".

It is always important to test on your particular data use case. If you have just 2 unique groups in data, then fast grouping algorithms will not shine much.

Also there is a community repository which meant to describe data.table algorithms https://github.com/asantucci/algo_data.table but it is not very active. I just recently posted there a comment about "groupby optimization", will paste it here as well. Answer was provided by data.table author Matt Dowle.

Q: does GForce allocate mem for biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and will be more cache efficient? if so, do we check if groups aren't sorted already? so we can avoid doing allocation and copy?

A: gforce (gsum) assigns to many group results at once; it doesn't gather the groups together. You're describing non-gforce (dogroup.c) which copies to the largest group. See the branch in dogroups.c which knows whether groups are already grouped: it swithes to a memcpy. The memcpy is very fast (contiguous, pre-fetch) so it's pretty good already. We must copy because R's DATAPTR is not a pointer we can repoint, it's an offset from SEXP.

like image 189
jangorecki Avatar answered Oct 04 '22 22:10

jangorecki