Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable allocation of large (8MB) memory regions on NUMA architectures

We are currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:

  • On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)

  • On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up

  • On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")

For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc, the hoard allocator, and TBB's cache aligned allocator.

The question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran. How can one heap allocate memory regions in the size of MBs efficiently on NUMA architectures from multiple threads? We'd really like to keep a dynamic allocation approach instead of preallocating memory and managing it within the application.

I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%). I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.

perf stat bin/prototype 598.867

Performance counter stats for 'bin/prototype':

  12965,118733 task-clock                #    7,779 CPUs utilized          
        10.973 context-switches          #    0,846 K/sec                  
         1.045 CPU-migrations            #    0,081 K/sec                  
       284.210 page-faults               #    0,022 M/sec                  
17.266.521.878 cycles                    #    1,332 GHz                     [82,84%]
15.286.104.871 stalled-cycles-frontend   #   88,53% frontend cycles idle    [82,84%]
10.719.958.132 stalled-cycles-backend    #   62,09% backend  cycles idle    [67,65%]
 3.744.397.009 instructions              #    0,22  insns per cycle        
                                         #    4,08  stalled cycles per insn [84,40%]
   745.386.453 branches                  #   57,492 M/sec                   [83,50%]
    26.058.804 branch-misses             #    3,50% of all branches         [83,33%]

   1,666595682 seconds time elapsed

perf stat numactl --cpunodebind=0 bin/prototype 272.614

Performance counter stats for 'numactl --cpunodebind=0 bin/prototype':

   3887,450198 task-clock                #    3,345 CPUs utilized          
         2.360 context-switches          #    0,607 K/sec                  
           208 CPU-migrations            #    0,054 K/sec                  
       282.794 page-faults               #    0,073 M/sec                  
 8.472.475.622 cycles                    #    2,179 GHz                     [83,66%]
 7.405.805.964 stalled-cycles-frontend   #   87,41% frontend cycles idle    [83,80%]
 6.380.684.207 stalled-cycles-backend    #   75,31% backend  cycles idle    [66,90%]
 2.170.702.546 instructions              #    0,26  insns per cycle        
                                         #    3,41  stalled cycles per insn [85,07%]
   430.561.957 branches                  #  110,757 M/sec                   [82,72%]
    16.758.653 branch-misses             #    3,89% of all branches         [83,06%]

   1,162185180 seconds time elapsed

perf stat numactl --cpunodebind=0-1 bin/prototype 356.726

Performance counter stats for 'numactl --cpunodebind=0-1 bin/prototype':

   6127,077466 task-clock                #    4,648 CPUs utilized          
         4.926 context-switches          #    0,804 K/sec                  
           469 CPU-migrations            #    0,077 K/sec                  
       283.291 page-faults               #    0,046 M/sec                  
10.217.787.787 cycles                    #    1,668 GHz                     [82,26%]
 8.944.310.671 stalled-cycles-frontend   #   87,54% frontend cycles idle    [82,54%]
 7.077.541.651 stalled-cycles-backend    #   69,27% backend  cycles idle    [68,59%]
 2.394.846.569 instructions              #    0,23  insns per cycle        
                                         #    3,73  stalled cycles per insn [84,96%]
   471.191.796 branches                  #   76,903 M/sec                   [83,73%]
    19.007.439 branch-misses             #    4,03% of all branches         [83,03%]

   1,318087487 seconds time elapsed

perf stat numactl --cpunodebind=0-2 bin/protoype 472.794

Performance counter stats for 'numactl --cpunodebind=0-2 bin/prototype':

   9671,244269 task-clock                #    6,490 CPUs utilized          
         7.698 context-switches          #    0,796 K/sec                  
           716 CPU-migrations            #    0,074 K/sec                  
       283.933 page-faults               #    0,029 M/sec                  
14.050.655.421 cycles                    #    1,453 GHz                     [83,16%]
12.498.787.039 stalled-cycles-frontend   #   88,96% frontend cycles idle    [83,08%]
 9.386.588.858 stalled-cycles-backend    #   66,81% backend  cycles idle    [66,25%]
 2.834.408.038 instructions              #    0,20  insns per cycle        
                                         #    4,41  stalled cycles per insn [83,44%]
   570.440.458 branches                  #   58,983 M/sec                   [83,72%]
    22.158.938 branch-misses             #    3,88% of all branches         [83,92%]

   1,490160954 seconds time elapsed

Minimal example: compiled with g++-4.7 std=c++11 -O3 -march=native; executed with numactl --cpunodebind=0 ... numactl --cpunodebind=0-3 - with CPU binding we have the following finding: 1 CPU (speed x), 2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3) [speed=the higher the better]. So what we see is that performance worsens with the number of CPUs. Memory binding, interleaving (--interleave=all) and --localalloc have no effect here (we monitored all QPI links and link load was below 1% for each link).

#include <tbb/pipeline.h>
#include <tbb/task_scheduler_init.h>
#include <chrono>
#include <stdint.h>
#include <iostream>
#include <fcntl.h>
#include <sstream>
#include <sys/mman.h>
#include <tbb/scalable_allocator.h>
#include <tuple>

namespace {
// 8 MB
size_t chunkSize = 8 * 1024 * 1024;
// Number of threads (0 = automatic)
uint64_t threads=0;
}

using namespace std;
typedef chrono::duration<double, milli> milliseconds;

int main(int /* argc */, char** /* argv */)
{
   chrono::time_point<chrono::high_resolution_clock> startLoadTime = chrono::high_resolution_clock::now();
   tbb::task_scheduler_init init(threads==0?tbb::task_scheduler_init::automatic:threads);
   const uint64_t chunks=128;
   uint64_t nextChunk=0;
   tbb::parallel_pipeline(128,tbb::make_filter<void,uint64_t>(
         tbb::filter::serial,[&](tbb::flow_control& fc)->uint64_t
   {
      uint64_t chunk=nextChunk++;
      if(chunk==chunks)
         fc.stop();

      return chunk;
   }) & tbb::make_filter<uint64_t,void>(
         tbb::filter::parallel,[&](uint64_t /* item */)->void
   {
        void* buffer=scalable_malloc(chunkSize);
        memset(buffer,0,chunkSize);
   }));

   chrono::time_point<chrono::high_resolution_clock> endLoadTime = chrono::high_resolution_clock::now();
   milliseconds loadTime = endLoadTime - startLoadTime;
   cout << loadTime.count()<<endl;
}

Discussion on Intel TBB forums: http://software.intel.com/en-us/forums/topic/346334

like image 660
muehlbau Avatar asked Dec 10 '12 15:12

muehlbau


1 Answers

Second Update (closing the question):

Just profiled the example application again with a 3.10 kernel.

Results for parallel allocation and memsetting of 16GB of data:

small pages:

  • 1 socket: 3112.29 ms
  • 2 socket: 2965.32 ms
  • 3 socket: 3000.72 ms
  • 4 socket: 3211.54 ms

huge pages:

  • 1 socket: 3086.77 ms
  • 2 socket: 1568.43 ms
  • 3 socket: 1084.45 ms
  • 4 socket: 852.697 ms

The scalable allocation problem seems to be fixed now - at least for huge pages.

like image 145
muehlbau Avatar answered Nov 15 '22 09:11

muehlbau