Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use every core when doing parallel processing in R?

I'm using R to convert some shapefiles. R does this using just one core of my processor, and I want to speed it up using parallel processing. So I've parallelized the process like this.

Given files which is a list of files to convert:

library(doMC)
registerDoMC()

foreach(f=files) %dopar% {
  # Code to do the conversion
}

This works just fine and it uses 2 cores. According to the documentation for registerDoMC(), by default that function uses half the cores detected by the parallel package.

My question is why should I use half of the cores instead of all the cores? (In this case, 4 cores.) By using the function registerDoMC(detectCores()) I can use all the cores on my system. What, if any, are the downsides to doing this?

like image 599
Lincoln Mullen Avatar asked Aug 17 '13 16:08

Lincoln Mullen


People also ask

How many cores should I use in parallel processing?

Based on the results 7 cores would be the fastest solution. If you run it on your own machine and want to do other stuff next to it, I would go with 4 cores as the timings are comparable and the machine is not working at max capacity.

Does parallelism require multiple cores?

The answer is: it depends. On a system with more than one processor or CPU cores (as is common with modern processors), multiple processes or threads can be executed in parallel. On a single core though, it is not possible to have processes or threads truly executing at the same time.

Does R use all cores?

By default, R uses only one core, but this article tells you how to use multiple cores. If your simulation needs 20 hours to complete with one core, you may get your results within four hours thanks to parallelization!

Does R automatically use multiple cores?

Unfortunately, R is not natively able to use several cores at the same time! This is true for most other programs as well. If you use a computer with 8 cores (8 CPUs) and ask R to perform 109 additions, do not expect 1/8 to be done on the first core, 1/8 on the second core, etc.


2 Answers

Besides the question of scalability, there is a simple rule: Intel Hyperthreading cores do not help, under Windows at least. So I get 8 with detectCores(), but I never found an improvement when going beyond 4 cores, even with MCMC parallel threads which in general scale perfectly.

If someone has a case (under Windows) where there is such an improvement from Hyperthreading, please post it.

like image 81
Dieter Menne Avatar answered Oct 16 '22 07:10

Dieter Menne


Any time you do parallel processing there is some overhead (which can be nontrivial, especially with locking data structures and blocking calls). For small batch jobs, running on a single core or two cores is much faster due to the fact that you're not paying that overhead.

I don't know the size of your job, but you should probably run some scaling experiments where you time your job on 1 processor, 2 processors, 4 processors, 8 processors, until you hit the max core count for your system (typically, you always double the processor count). EDIT: It looks like you're only using 4 cores, so time with 1, 2, and 4.

Run timing results for ~32 trials for each core count and get a confidence interval, then you can say for certain whether running on all cores is right for you. If your job takes a long time, reduce the # of trials, all the way down to 5 or so, but remember that more trials will give you a higher degree of confidence.

To elaborate:

Student's t-test:

The student's t-test essentially says "you calculated an average time for this core count, but that's not the true average. We can only get the true average if we had the average of an infinite number of data points. Your computed true average actually lies in some interval around your computed average"

The t-test for significance then basically compares the intervals around the true average for 2 datapoints and says whether they are significantly different or not. So you may have one average time be less than another, but because the standard deviation is sufficiently high, we can't for certain say that it's actually less; the true averages may be identical.

So, to compute this test for significance:

  • Run your timing experiments
  • For each core count:
  • Compute your mean and standard deviation. The standard deviation should be the population standard deviation, which is the square root of population variance Population variance is (1/N) * summation_for_all_data_points((datapoint_i - mean)^2)

Now you will have a mean and standard deviations for each core count: (m_1, s_1), (m_2, s_2), etc. - For every pair of core counts: - Compute a t-value: t = (mean_1 - mean_2)/(s_1/ sqrt(#dataPoints))

The example t value I showed tests whether the mean timing results for core count of 1 is significantly different than the timing results for core count of 2. You could test the other way around by saying:

t = (m_2 - m_1)/(s_2/ sqrt(#dataPoints))

After you computed these t-values, you can tell whether they're significant by looking at the critical value table. Now, before you click that, you need to know about 2 more things:

Degrees of Freedom

This is related to the number of datapoints you have. The more datapoints you have, the smaller the interval around mean probably is. Degrees of freedom kind of measures your computed mean's ability to move about, and it is #dataPoints - 1 (v in the link I provided).

Alpha

Alpha is a probability threshold. In the Gaussian (Normal, bell-curved) distribution, alpha cuts the bell-curve on both the left and the right. Any probability in the middle of the cutoffs falls inside the threshold and is an insignificant result. A lower alpha makes it harder to get a significant result. That is alpha = 0.01 means only the top 1% of probabilities are significant, and alpha = 0.05 means the top 5%. Most people use alpha = 0.05.

In the table I link to, 1-alpha determines the column you will go down looking for a critical value. (so alpha = 0.05 gives 0.95, or a 95% confidence interval), and v is your degrees of freedom, or row to look at.

If your critical value is less than your computed t (absolute value), then your result is NOT significant. If the critical value is greater than your computed t (absolute value), then you have statistical significance.

Edit: The Student's t-test assumes that variances and standard deviations are the same between the two means being compared. That is, it assumes the distribution of data points around the true mean is equal. If you DON'T want to make this assumption, then you're looking for Welch's t-test, which is slightly different. The wiki page has a good formula for computing t-values for this test.

like image 44
AndyG Avatar answered Oct 16 '22 06:10

AndyG