Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel computing memory access bottleneck

The following algorithm is run iteratively in my program. running it, without the two lines indicated below, takes 1.5X as long as without. That is very surprising to me as it is. Worse, however, is that running with those two lines increases completion to 4.4X of running without them (6.6X not running whole algorithm). Additionaly, it causes my program to fail to scale beyond ~8 cores. In fact, when run on a single core, the two lines only increase time to 1.7x, which is still way too high considering what they do. I've ruled out that it has to do with an effect of the modified data elsewhere in my program.

So I'm wondering what could be causing this. Something to do with the cache maybe?

void NetClass::Age_Increment(vector <synapse> & synapses, int k)  
{
    int size = synapses.size();
    int target = -1;

    if(k > -1)
    {
        for(int q=0, x=0 ; q < size; q++)
        {
            if(synapses[q].active)
                synapses[q].age++;
            else
            {
                if(x==k)target=q;
                x++;
            }
        }
        /////////////////////////////////////Causing Bottleneck/////////////
        synapses[target].active = true;
        synapses[target].weight = .04 + (float (rand_r(seedp) % 17) / 100);
        ////////////////////////////////////////////////////////////////////
    }

    else
    {
        for(int q=0 ; q < size; q++)
            if(synapses[q].active)
                synapses[q].age++;
    }
}

Update: Changing the two problem lines to:

bool x = true;
float y = .04 + (float (rand_r(seedp) % 17) / 100);

Removes the problem. Suggesting maybe that it's something to do with memory access?

like image 282
Matt Munson Avatar asked Nov 27 '11 05:11

Matt Munson


People also ask

What is the main bottleneck in parallel processing?

Parallel slowdown is typically the result of a communications bottleneck. As more processor nodes are added, each processing node spends progressively more time doing communication than useful processing.

What are the disadvantages of parallel computing?

Disadvantages of Parallel ComputingIt requires the managed algorithms, which could be handled in the parallel mechanism. The multi-core architectures consume high power consumption. The parallel computing system needs low coupling and high cohesion, which is difficult to create.

What is bottleneck in memory?

A memory bottleneck refers to a memory shortage due to insufficient memory, memory leaks, defective programs or when slow memory is used in a fast processor system. A memory bottleneck affects the machine's performance by slowing down the movement of data between the CPU and the RAM.


2 Answers

Each thread modifies memory all the other reads read:

for(int q=0, x=0 ; q < size; q++)
   if(synapses[q].active) ... // ALL threads read EVERY synapse.active
...
synapses[target].active = true; // EVERY thread writes at leas one synapse.active

These kind of reads and writes on the same address from different threads cause a great deal of cache invalidation, which will result in exactly the symptoms you describe. The solution is to avoid the write inside the loop, and the fact that moving the write into local variables is, again, proof that the problem is cache invalidation. Note that even if you wouldn't write the sane field being read (active), you would likely see the same symptoms due to false sharing, as I suspect that active, age and weight share a cache line.

For more details see CPU Caches and Why You Care

A final note is that the assignment to active and weight, not to mention the age++increment all seem extremely thread unsafe. Interlocked operations or lock/mutex protection for such updates would be mandatory.

like image 129
Remus Rusanu Avatar answered Oct 26 '22 04:10

Remus Rusanu


Try re-introducing these two lines, but without rand_r, just to see if you get the same performance deterioration. If you don't, this is probably a sign that the rand_r is internally serialized (e.g. through a mutex), so you'd need to find a way to generate random numbers more concurrently.

The other potential area of concern is false sharing (if you have time, take a look at Herb Sutter's video and slides treating this subject, among others). Essentially, if your threads happen to modify different memory locations that are close enough to fall into the same cache line, the cache coherency hardware may effectively serialize the memory access and destroy the scalability. What makes this hard to diagnose is the fact that these memory locations may be logically independent and it may not be intuitively obvious they ended up close together at run-time. Try adding some padding to split such memory locations apart if you suspect false sharing.

like image 23
Branko Dimitrijevic Avatar answered Oct 26 '22 05:10

Branko Dimitrijevic