I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow. My code is as simple as this: <pre class="prettyprint"><code>DT[, var := some_function(var2)] </code></pre> If I'm not mistaken, <code>data.table</code> uses multithread when it is called with <code>by</code>, and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as <pre class="prettyprint"><code>DT[, grouper := .I %/% 100] </code></pre> and do <pre class="prettyprint"><code>DT[, var := some_function(var2), by = grouper] </code></pre> I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. So my questions are: <ol> <li>Does <code>data.table</code> use multithreading when it's used with <code>by</code>?</li> <li>If so, is there a condition that multithreading is enabled / disabled?</li> <li>Is there a way that user can "enforce" <code>data.table</code> to use multithreading here?</li> </ol> FYI, I see that multithreading enabled with half of my cores when I import data.table, so I guess there's no openMP issue here.

I got answers from <code>data.table</code> developers from data.table github. Here's a summary: <ul> <li> Finding groups of <code>by</code> variable itself is parallelized always, but more importantly, </li> <li> If the function on <code>j</code> is generic (User Defined Function) then there's no parallelization. </li> <li> Operations on <code>j</code> is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functions <code>min</code>, <code>max</code>, <code>mean</code>, <code>median</code>, <code>var</code>, <code>sd</code>, <code>sum</code>, <code>prod</code>, <code>first</code>, <code>last</code>, <code>head</code>, <code>tail</code>) </li> </ul> So, it is advised to do parallel operation manually if the function on <code>j</code> is generic, but it may not always guarantee speed gain. Reference ==Solution== In my case, I encountered vector memory exhaust when I plainly used <code>DT[, var := some_function(var2)]</code> even though my server had 1TB of ram, while data was taking 200GB of memory. I used <code>split(DT, by='grouper')</code> to split my <code>data.table</code> into chunks, and utilized <code>doFuture</code> <code>foreach</code> <code>%dopar%</code> to do the job. It was pretty fast.

Parallelizing / Multithreading with data.table

Tags:

r

multithreading

parallel-processing

data.table

I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.

My code is as simple as this:

DT[, var := some_function(var2)]

If I'm not mistaken, data.table uses multithread when it is called with by, and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as

DT[, grouper := .I %/% 100]

and do

DT[, var := some_function(var2), by = grouper]

I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. So my questions are:

Does data.table use multithreading when it's used with by?
If so, is there a condition that multithreading is enabled / disabled?
Is there a way that user can "enforce" data.table to use multithreading here?

FYI, I see that multithreading enabled with half of my cores when I import data.table, so I guess there's no openMP issue here.

948

asked Oct 06 '21 21:10

Matthew Son

1 Answers

I got answers from data.table developers from data.table github.

Here's a summary:

Finding groups of by variable itself is parallelized always, but more importantly,
If the function on j is generic (User Defined Function) then there's no parallelization.
Operations on j is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail)

So, it is advised to do parallel operation manually if the function on j is generic, but it may not always guarantee speed gain. Reference

==Solution==

In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)] even though my server had 1TB of ram, while data was taking 200GB of memory.

I used split(DT, by='grouper') to split my data.table into chunks, and utilized doFuture foreach %dopar% to do the job. It was pretty fast.

answered Sep 20 '22 16:09

Matthew Son

Related questions
                            
                                Error: package or namespace load failed for ‘data.table’ in library.dynam(lib, package, package.lib): shared object ‘datatable.so’ not found
                            
                                Tidying financial data with mixed decimal and grouping digits
                            
                                How to read data from google drive using R in colab?
                            
                                Is there any explicit guarantee that dplyr operations preserve row order?
                            
                                Comparison of two vectors resulted after simulation
                            
                                Function of function always returns 0-R
                            
                                Why do I get a segfault when calling my C++ function with .Call rather than .C?
                            
                                DiagrammeR - arrow problems
                            
                                R Flatten nested lists of different lengths (Google geocode API output) in R
                            
                                Can't plot ggplot2 objects created with R 3.x into R 4.x imported from a RDS file
                            
                                Pivoting data with varying width from wide to long with flexible call (to be used in loop)
                            
                                How to specify random coefficients priors in rstanarm?
                            
                                How to exact match two column values in entire Dataset using R
                            
                                Pairwise correlation from Dunnett's rank test
                            
                                Dropping a column from a data.frame causes unwanted loss of an attribute
                            
                                data.table switches column names
                            
                                Install with devtools::install_github() fails to detect build tools
                            
                                Vertically scrollable code with RStudio and xaringan
                            
                                Add multiple level x-label in ggplot2
                            
                                How to add to a cnetplot using ggplot functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallelizing / Multithreading with data.table

Tags:

r

multithreading

parallel-processing

data.table

Matthew Son

People also ask

1 Answers

Matthew Son

Recent Activity

Donate For Us