Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TBB acting strange in Matlab Mex file

Edited:< Matlab limits TBB but not OpenMP > My question is different than the one above, it's not duplicated though using the same sample code for illustration. In my case I specified num of threads in tbb initialization instead of using "deferred". Also I'm talking about the strange behavior between TBB in c++ and TBB in mex. The answer to that question only demonstrates thread initialization when running TBB in C++, not in MEX.


I'm trying to boost a Matlab mex file to improve performance. The strange thing I come across when using TBB within mex is that TBB initialization doesn't work as expected.

This C++ program performs 100% cpu usage and has 15 TBB threads when executing it alone:

main.cpp

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"

struct mytask {
  mytask(size_t n)
    :_n(n)
  {}
  void operator()() {
    for (long i=0;i<10000000000L;++i) {}  // Deliberately run slow
    std::cerr << "[" << _n << "]";
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};

void mexFunction(/* int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] */) {

  tbb::task_scheduler_init init(15);  // 15 threads

  std::vector<mytask> tasks;
  for (int i=0;i<10000;++i)
    tasks.push_back(mytask(i));

  tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

}

int main()
{
    mexFunction();
}

Then I modified the code a little bit to make a MEX for matlab:

BuildMEX.mexw64

#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"

struct mytask {
  mytask(size_t n)
    :_n(n)
  {}
  void operator()() {
    for (long i=0;i<10000000000L;++i) {}  // Deliberately run slow
    std::cerr << "[" << _n << "]";
  }
  size_t _n;
};

template <typename T> struct invoker {
  void operator()(T& it) const {it();}
};


void mexFunction( int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] ) {

  tbb::task_scheduler_init init(15);  // 15 threads

  std::vector<mytask> tasks;
  for (int i=0;i<10000;++i)
    tasks.push_back(mytask(i));

  tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());

}

Eventually invoke BuildMEX.mexw64 in Matlab. I compiled(mcc) the following code snippet to Matlab binary "MEXtest.exe" and use vTune to profile its performance(run in MCR). The TBB within the process only initialized 4 tbb threads and the binary only occupies ~50% cpu usage. Why MEX is downgrading overall performance and TBB? How can I seize more cpu usage for mex?

MEXtest.exe

function MEXtest()

BuildMEX();

end
like image 476
yfeng Avatar asked Oct 21 '22 06:10

yfeng


1 Answers

According to the scheduler class description:

This class allows to customize properties of the TBB task pool to some extent. For example it can limit concurrency level of parallel work initiated by the given thread. It also can be used to specify stack size of the TBB worker threads, though this setting is not effective if the thread pool has already been created.

This is further explained in the initialize() methods called by the constructor:

The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple task_scheduler_inits. Doing so does no harm because the underlying scheduler is reference counted.

(highlighted parts added by me)

I believe that MATLAB already uses Intel TBB internally, and it must have initialized a thread pool at a top level before the MEX-function is ever executed. Thus all task schedulers in your code are going to use the number of threads specified by internal parts of MATLAB, ignoring the value you specified in your code.

By default MATLAB must have initialized the thread pool with a size equal to the number of physical processors (not logicals), which is indicated by the fact that on my quad-core hyper-threaded machine I get:

>> maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release [...]
ans =
     4

OpenMP on the other has no scheduler, and we can control number of threads at runtime by calling the following functions:

#include <omp.h>
.. 
omp_set_dynamic(1);
omp_set_num_threads(omp_get_num_procs());

or by setting the environment variable:

>> setenv('OMP_NUM_THREADS', '8')

To test this proposed explanation, here is the code I used:

test_tbb.cpp

#ifdef MATLAB_MEX_FILE
#include "mex.h"
#endif

#include <cstdlib>
#include <cstdio>
#include <vector>

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for_each.h"
#include "tbb/spin_mutex.h"

#include "tbb_helpers.hxx"

#define NTASKS 100
#define NLOOPS 400000L

tbb::spin_mutex print_mutex;

struct mytask {
    mytask(size_t n) :_n(n) {}
    void operator()()
    {
        // track maximum number of parallel workers run
        ConcurrencyProfiler prof;

        // burn some CPU cycles!
        double x = 1.0 / _n;
        for (long i=0; i<NLOOPS; ++i) {
            x = sin(x) * 10.0;
            while((double) rand() / RAND_MAX < 0.9);
        }
        {
            tbb::spin_mutex::scoped_lock s(print_mutex);
            fprintf(stderr, "%f\n", x);
        }
    }
    size_t _n;
};

template <typename T> struct invoker {
    void operator()(T& it) const { it(); }
};

void run()
{
    // use all 8 logical cores
    SetProcessAffinityMask(GetCurrentProcess(), 0xFF);

    printf("numTasks = %d\n", NTASKS);
    for (int t = tbb::task_scheduler_init::automatic;
         t <= 512; t = (t>0) ? t*2 : 1)
    {
        tbb::task_scheduler_init init(t);

        std::vector<mytask> tasks;
        for (int i=0; i<NTASKS; ++i) {
            tasks.push_back(mytask(i));
        }

        ConcurrencyProfiler::Reset();
        tbb::parallel_for_each(tasks.begin(), tasks.end(), invoker<mytask>());

        printf("pool_init(%d) -> %d worker threads\n", t,
            ConcurrencyProfiler::GetMaxNumThreads());
    }
}

#ifdef MATLAB_MEX_FILE
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
    run();
}
#else
int main()
{
    run();
    return 0;
}
#endif

Here is the code for a simple helper class used to profile concurrency by keeping track of how many workers were invoked from the thread pool. You could always use Intel VTune or any other profiling tool to get the same kind of information:

tbb_helpers.hxx

#ifndef HELPERS_H
#define HELPERS_H

#include "tbb/atomic.h"

class ConcurrencyProfiler
{
public:
    ConcurrencyProfiler();
    ~ConcurrencyProfiler();
    static void Reset();
    static size_t GetMaxNumThreads();
private:
    static void RecordMax();
    static tbb::atomic<size_t> cur_count;
    static tbb::atomic<size_t> max_count;
};

#endif

tbb_helpers.cxx

#include "tbb_helpers.hxx"

tbb::atomic<size_t> ConcurrencyProfiler::cur_count;
tbb::atomic<size_t> ConcurrencyProfiler::max_count;

ConcurrencyProfiler::ConcurrencyProfiler()
{
    ++cur_count;
    RecordMax();
}

ConcurrencyProfiler::~ConcurrencyProfiler()
{
    --cur_count;
}

void ConcurrencyProfiler::Reset()
{
    cur_count = max_count = 0;
}

size_t ConcurrencyProfiler::GetMaxNumThreads()
{
    return static_cast<size_t>(max_count);
}

// Performs: max_count = max(max_count,cur_count)
// http://www.threadingbuildingblocks.org/
//    docs/help/tbb_userguide/Design_Patterns/Compare_and_Swap_Loop.htm
void ConcurrencyProfiler::RecordMax()
{
    size_t o;
    do {
        o = max_count;
        if (o >= cur_count) break;
    } while(max_count.compare_and_swap(cur_count,o) != o);
}

First I compile the code as a native executable (I am using Intel C++ Composer XE 2013 SP1, with VS2012 Update 4):

C:\> vcvarsall.bat amd64
C:\> iclvars.bat intel64 vs2012
C:\> icl /MD test_tbb.cpp tbb_helpers.cxx tbb.lib

I run the program in the system shell (Windows 8.1). It goes up to 100% CPU utilization and I get the following output:

C:\> test_tbb.exe 2> nul
numTasks = 100
pool_init(-1) -> 8 worker threads          // task_scheduler_init::automatic
pool_init(1) -> 1 worker threads
pool_init(2) -> 2 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 8 worker threads
pool_init(16) -> 16 worker threads
pool_init(32) -> 32 worker threads
pool_init(64) -> 64 worker threads
pool_init(128) -> 98 worker threads
pool_init(256) -> 100 worker threads
pool_init(512) -> 98 worker threads

As expected, the thread pool is initialized as large as we asked, and being fully utilized being limited by the number of tasks we created (in the last case we have 512 threads for only 100 parallel tasks!).

Next I compile the code as a MEX-file:

>> mex -I"C:\Program Files (x86)\Intel\Composer XE\tbb\include" ...
   -largeArrayDims test_tbb.cpp tbb_helpers.cxx ...
   -L"C:\Program Files (x86)\Intel\Composer XE\tbb\lib\intel64\vc11" tbb.lib

Here is the output I get when I run the MEX-function in MATLAB:

>> test_tbb()
numTasks = 100
pool_init(-1) -> 4 worker threads
pool_init(1) -> 4 worker threads
pool_init(2) -> 4 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 4 worker threads
pool_init(16) -> 4 worker threads
pool_init(32) -> 4 worker threads
pool_init(64) -> 4 worker threads
pool_init(128) -> 4 worker threads
pool_init(256) -> 4 worker threads
pool_init(512) -> 4 worker threads

As you can see, no matter what we specify as pool size, the scheduler always spins at most 4 threads to execute the parallel tasks (4 being the number of physical processors on my quad-core machine). This confirms what I stated in the beginning of the post.

Note that I explicitly set the processor affinity mask to use all 8 cores, but since there are only 4 running threads, CPU usage stayed approximately at 50% in this case.

Hope this helps answer the question, and sorry for the long post :)

like image 149
Amro Avatar answered Oct 23 '22 09:10

Amro