I ran the following matlab code:
rng(1)
matrix_size = 200;
iterations = 100000;
A = rand(matrix_size);
B = rand(matrix_size);
profile on
for i = 1:iterations
A * B;
end
profile off
On my MacAir (Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz), this takes 39s. On a workstation with 7 cores (Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz), this takes 62s.
I did not specify -singleCompThread
. The work station has 12 cores, but there were 5 single-threaded processes running. I had (almost) 7 cores to myself. They were maxed out the whole time.
How can this be?
When running the above code with -singleCompThread
, it completes in 54s.
Quoting a Mathworks support team post:
As of MATLAB 7.4 (R2007a), MATLAB supports multithreaded computation for a number of functions and expressions that are combinations of element-wise functions (e.g. y=4*x*(sin(x) + x^3)). These functions automatically execute on multiple threads and you do not need to explicitly specify commands to create threads in your code.
For a function or expression to execute faster (speed up) on multiple cores, the following conditions must be true:
1) The operations in the algorithm carried out by the function are easily partitioned into sections that can be executed concurrently, and with little communication or few sequential operations required. This is the case for all element-wise operations.
2) The data size is large enough so that any advantages of concurrent execution outweigh the time required to partition the data and manage separate execution threads. For example, most functions speed up only when the array is greater than several thousand elements.
3) The operation is not memory-bound where the processing time is dominated by memory access time, as is the case for simple operations such as element-wise addition. As a general rule, more complex functions speed up better than simple functions.
Your case does not fullfill 2. or 3. Multiplication is extremely fast and simple and is memory bound, and your matrices are relatively tiny. The multithreading would appear to include more overhead, as seen from your test with the -singleCompThread. You could try out the benchmark with a larger matrix and see if the difference changes. You could also try the benchmark on the Macbook with the -singleCompThread on to see whether the relative single thread performance falls into the expected range.
Another (partial) explanation could be the different vector instructions between Sandy Bridge and Haswell, i.e. AVX2. I'd do the benchmarks first before looking into that though.
Also note that the Matlab profiler turns off JIT. So the results you get might not be comparable to whatever real case you are benchmarking against.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With