Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TParallel.For performance

Given the following simple task of finding odd numbers in a one dimensional array:

begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  for i := 0 to MaxArr-1 do
      if ArrXY[i] mod 2 = 0 then
        Inc(odds);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Serial: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

It looks like this would be a good candidate for parallel processing. So one might be tempted to use the following TParallel.For version:

begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  TParallel.For(0,  MaxArr-1, procedure(I:Integer)
  begin
    if ArrXY[i] mod 2 = 0 then
      inc(odds);
  end);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Parallel - false odds: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

The result of this parallel computation is somewhat surprising in two respects:

  1. The number of counted odds is wrong

  2. The execution time is longer than in the serial version

1) Is explainable, because we did not protect the odds variable for concurrent access. So in order to fix this, we should use TInterlocked.Increment(odds); instead.

2) Is also explainable: It exhibits the effects of false sharing.

Ideally the solution to the false sharing problem would be to use a local variable to store intermediate results and only at the end of all parallel tasks sum up those intermediaries. And here is my real question that I cannot get my head around: Is there any way to get a local variable into my anonymous method? Note, that simply declaring a local variable within the anonymous method body would not work, as the anonymous method body is called for each iteration. And if that is somehow doable, would there be a way to get my intermediate result at the end of each task iteration out of the anonymous method?

Edit: I am actually not really interested in counting odds, or evans. I only use this to demonstrate the effect.

And for completeness reasons here is a console app demonstrating the effects:

program Project4;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, System.Threading, System.Classes, System.SyncObjs;

const
  MaxArr = 100000000;

var
  Ticks: Cardinal;
  i: Integer;
  odds: Integer;
  ArrXY: array of Integer;

procedure FillArray;
var
  i: Integer;
  j: Integer;
begin
  SetLength(ArrXY, MaxArr);
  for i := 0 to MaxArr-1 do
      ArrXY[i]:=Random(MaxInt);
end;

procedure Parallel;
begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  TParallel.For(0,  MaxArr-1, procedure(I:Integer)
  begin
    if ArrXY[i] mod 2 = 0 then
      TInterlocked.Increment(odds);
  end);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Parallel: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

procedure ParallelFalseResult;
begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  TParallel.For(0,  MaxArr-1, procedure(I:Integer)
  begin
    if ArrXY[i] mod 2 = 0 then
      inc(odds);
  end);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Parallel - false odds: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

procedure Serial;
begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  for i := 0 to MaxArr-1 do
      if ArrXY[i] mod 2 = 0 then
        Inc(odds);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Serial: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

begin
  try
    FillArray;
    Serial;
    ParallelFalseResult;
    Parallel;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  Readln;
end.
like image 344
iamjoosy Avatar asked Dec 17 '14 21:12

iamjoosy


People also ask

What is parallel processing performance?

Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program.

What is parallel processing example?

In parallel processing, we take in multiple different forms of information at the same time. This is especially important in vision. For example, when you see a bus coming towards you, you see its color, shape, depth, and motion all at once. If you had to assess those things one at a time, it would take far too long.

What is parallel processing good for?

Benefits of parallel computing. The advantages of parallel computing are that computers can execute code more efficiently, which can save time and money by sorting through “big data” faster than ever. Parallel programming can also solve more complex problems, bringing more resources to the table.

How do you measure parallel computing performance?

There are many ways to measure the performance of a parallel algorithm running on a parallel processor. The most commonly used measurements are the elapsed time, price/performance, the speed-up, and the efficiency.


2 Answers

The key for this problem is correct partitioning and sharing as little as possible.

With this code it runs almost 4 times faster than the serial one.

const 
  WorkerCount = 4;

function GetWorker(index: Integer; const oddsArr: TArray<Integer>): TProc;
var
  min, max: Integer;
begin
  min := MaxArr div WorkerCount * index;
  if index + 1 < WorkerCount then
    max := MaxArr div WorkerCount * (index + 1) - 1
  else
    max := MaxArr - 1;
  Result :=
    procedure
    var
      i: Integer;
      odds: Integer;
    begin
      odds := 0;
      for i := min to max do
        if Odd(ArrXY[i]) then
          Inc(odds);
      oddsArr[index] := odds;
    end;
end;

procedure Parallel;
var
  i: Integer;
  oddsArr: TArray<Integer>;
  workers: TArray<ITask>;
begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  SetLength(oddsArr, WorkerCount);
  SetLength(workers, WorkerCount);

  for i := 0 to WorkerCount-1 do
    workers[i] := TTask.Run(GetWorker(i, oddsArr));
  TTask.WaitForAll(workers);

  for i := 0 to WorkerCount-1 do
    Inc(odds, oddsArr[i]);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('Parallel: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);
end;

You can write similar code with the TParallel.For but it still runs a bit slower (like 3 times faster than serial) than just using TTask.

Btw I used the function to return the worker TProc to get the index capturing right. If you run it in a loop in the same routine you capture the loop variable.

Update 19.12.2014:

Since we found out the critical thing is correct partitioning this can be put into a parallel for loop really easily without locking it on a particular data structure:

procedure ParallelFor(lowInclusive, highInclusive: Integer;
  const iteratorRangeEvent: TProc<Integer, Integer>);

  procedure CalcPartBounds(low, high, count, index: Integer;
    out min, max: Integer);
  var
    len: Integer;
  begin
    len := high - low + 1;
    min := (len div count) * index;
    if index + 1 < count then
      max := len div count * (index + 1) - 1
    else
      max := len - 1;
  end;

  function GetWorker(const iteratorRangeEvent: TProc<Integer, Integer>;
    min, max: Integer): ITask;
  begin
    Result := TTask.Run(
      procedure
      begin
        iteratorRangeEvent(min, max);
      end)
  end;

var
  workerCount: Integer;
  workers: TArray<ITask>;
  i, min, max: Integer;
begin
  workerCount := TThread.ProcessorCount;
  SetLength(workers, workerCount);
  for i := 0 to workerCount - 1 do
  begin
    CalcPartBounds(lowInclusive, highInclusive, workerCount, i, min, max);
    workers[i] := GetWorker(iteratorRangeEvent, min, max);
  end;
  TTask.WaitForAll(workers);
end;

procedure Parallel4;
begin
  odds := 0;
  Ticks := TThread.GetTickCount;
  ParallelFor(0, MaxArr-1,
    procedure(min, max: Integer)
    var
      i, n: Integer;
    begin
      n := 0;
      for i := min to max do
        if Odd(ArrXY[i]) then
          Inc(n);
      AtomicIncrement(odds, n);
    end);
  Ticks := TThread.GetTickCount - Ticks;
  writeln('ParallelEx: Stefan Glienke ' + Ticks.ToString + ' ms, odds: ' + odds.ToString);
end;

The key thing is to use a local variable for the counting and only at the end use the shared variable one time to add the sub total.

like image 160
Stefan Glienke Avatar answered Nov 15 '22 22:11

Stefan Glienke


With OmniThreadLibrary from the SVN (this is not yet including in any official release), you can write this in a way which doesn't require interlocked access to the shared counter.

function CountParallelOTL: integer;
var
  counters: array of integer;
  numCores: integer;
  i: integer;
begin
  numCores := Environment.Process.Affinity.Count;
  SetLength(counters, numCores);
  FillChar(counters[0], Length(counters) * SizeOf(counters[0]), 0);

  Parallel.For(0, MaxArr - 1)
    .NumTasks(numCores)
    .Execute(
      procedure(taskIndex, value: integer)
      begin
        if Odd(ArrXY[value]) then
          Inc(counters[taskIndex]);
      end);

  Result := counters[0];
  for i := 1 to numCores - 1 do
    Inc(Result, counters[i]);
end;

This, however, is still at best on par with the sequential loop and at worst a few times slower.

I have compared this with Stefan's solution (XE7 tasks) and with a simple XE7 Parallel.For with interlocked increment (XE7 for).

Results from my notebook with 4 hyperthreaded cores:

Serial: 49999640 odd elements found in 543 ms

Parallel (OTL): 49999640 odd elements found in 555 ms

Parallel (XE7 tasks): 49999640 odd elements found in 136 ms

Parallel (XE7 for): 49999640 odd elements found in 1667 ms

Results from my workstation with 12 hyperthreaded cores:

Serial: 50005291 odd elements found in 685 ms

Parallel (OTL): 50005291 odd elements found in 1309 ms

Parallel (XE7 tasks): 50005291 odd elements found in 62 ms

Parallel (XE7 for): 50005291 odd elements found in 3379 ms

There's a big improvement over System.Threading Paralell.For because there's no interlocked increment but the handcrafted solution is much much faster.

Full test program:

program ParallelCount;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SyncObjs,
  System.Classes,
  System.SysUtils,
  System.Threading,
  DSiWin32,
  OtlCommon,
  OtlParallel;

const
  MaxArr = 100000000;

var
  Ticks: Cardinal;
  i: Integer;
  odds: Integer;
  ArrXY: array of Integer;

procedure FillArray;
var
  i: Integer;
  j: Integer;
begin
  SetLength(ArrXY, MaxArr);
  for i := 0 to MaxArr-1 do
    ArrXY[i]:=Random(MaxInt);
end;

function CountSerial: integer;
var
  odds: integer;
begin
  odds := 0;
  for i := 0 to MaxArr-1 do
      if Odd(ArrXY[i]) then
        Inc(odds);
  Result := odds;
end;

function CountParallelOTL: integer;
var
  counters: array of integer;
  numCores: integer;
  i: integer;
begin
  numCores := Environment.Process.Affinity.Count;
  SetLength(counters, numCores);
  FillChar(counters[0], Length(counters) * SizeOf(counters[0]), 0);

  Parallel.For(0, MaxArr - 1)
    .NumTasks(numCores)
    .Execute(
      procedure(taskIndex, value: integer)
      begin
        if Odd(ArrXY[value]) then
          Inc(counters[taskIndex]);
      end);

  Result := counters[0];
  for i := 1 to numCores - 1 do
    Inc(Result, counters[i]);
end;

function GetWorker(index: Integer; const oddsArr: TArray<Integer>; workerCount: integer): TProc;
var
  min, max: Integer;
begin
  min := MaxArr div workerCount * index;
  if index + 1 < workerCount then
    max := MaxArr div workerCount * (index + 1) - 1
  else
    max := MaxArr - 1;
  Result :=
    procedure
    var
      i: Integer;
      odds: Integer;
    begin
      odds := 0;
      for i := min to max do
        if Odd(ArrXY[i]) then
          Inc(odds);
      oddsArr[index] := odds;
    end;
end;

function CountParallelXE7Tasks: integer;
var
  i: Integer;
  oddsArr: TArray<Integer>;
  workers: TArray<ITask>;
  workerCount: integer;
begin
  workerCount := Environment.Process.Affinity.Count;
  odds := 0;
  Ticks := TThread.GetTickCount;
  SetLength(oddsArr, workerCount);
  SetLength(workers, workerCount);

  for i := 0 to workerCount-1 do
    workers[i] := TTask.Run(GetWorker(i, oddsArr, workerCount));
  TTask.WaitForAll(workers);

  for i := 0 to workerCount-1 do
    Inc(odds, oddsArr[i]);
  Result := odds;
end;

function CountParallelXE7For: integer;
var
  odds: integer;
begin
  odds := 0;
  TParallel.For(0,  MaxArr-1, procedure(I:Integer)
  begin
    if Odd(ArrXY[i]) then
      TInterlocked.Increment(odds);
  end);
  Result := odds;
end;

procedure Count(const name: string; func: TFunc<integer>);
var
  time: int64;
  cnt: integer;
begin
  time := DSiTimeGetTime64;
  cnt := func();
  time := DSiElapsedTime64(time);
  Writeln(name, ': ', cnt, ' odd elements found in ', time, ' ms');
end;

begin
  try
    FillArray;

    Count('Serial', CountSerial);
    Count('Parallel (OTL)', CountParallelOTL);
    Count('Parallel (XE7 tasks)', CountParallelXE7Tasks);
    Count('Parallel (XE7 for)', CountParallelXE7For);

    Readln;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.
like image 21
gabr Avatar answered Nov 15 '22 21:11

gabr