I am trying out the Parallel Programming features of Delphi XE7 Update 1.
I created a simple TParallel.For
loop that basically does some bogus operations to pass the time.
I launched the program on a 36 vCPU at an AWS instance (c4.8xlarge) to try to see what the gain of Parallel Programming could be.
When I first launch the program and execute the TParallel.For
loop, I see a significant gain (although admitelly a lot less than I anticipated with 36 vCPUs):
Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms
If I do not close the program and run the pass again on the 36 vCPU machine shortly after (for example, immediately or some 10-20 seconds later), the Parallel pass worsens a lot:
Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms
If I don't close the program and I wait a few minutes (not a few seconds, but a few minutes) before running the pass again, I get again the results I get when first launching the program (10x improvement in response time).
The very first pass right after launching the program is always faster on the 36 vCPUs machine, so it seems that this effect only happens the second time a TParallel.For
is called in the program.
This is the sample code I'm running:
unit ParallelTests;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
System.Threading, System.SyncObjs, System.Diagnostics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
SingleThreadCheckBox: TCheckBox;
ParallelCheckBox: TCheckBox;
UnitsEdit: TEdit;
Label1: TLabel;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.Button1Click(Sender: TObject);
var
matches: integer;
i,j: integer;
sw: TStopWatch;
maxItems: integer;
referenceStr: string;
begin
sw := TStopWatch.Create;
maxItems := 5000;
Randomize;
SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26));
if ParallelCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
TParallel.For(1, MaxItems,
procedure (Value: Integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := 1 to length(referenceStr) do begin
if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
inc(found);
end;
end;
TInterlocked.Add(matches, found);
end);
sw.Stop;
Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
if SingleThreadCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
for i := 1 to MaxItems do begin
for j := 1 to length(referenceStr) do begin
if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
inc(matches);
end;
end;
end;
sw.Stop;
Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
end;
end.
Is this working as designed? I found this article (http://delphiaball.co.uk/tag/parallel-programming/) recommending that I let the library decide the thread pool, but I do not see the point of using Parallel Programming if I have to wait minutes from request to request so that the request is served faster.
Am I missing anything on how a TParallel.For
loop is supposed to be used?
Please note that I cannot reproduce this on a AWS m3.large instance (2 vCPU according to AWS). In that instance, I always get a slight improvement, and I do not get a worse result in subsequent calls of TParallel.For
shortly after.
Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms
So it seems that this effect occurs when there are many cores available (36), which is a pity because the whole point of Parallel Programming is to benefit from many cores. I wonder if this is a library bug because of the high count of cores or the fact that the core count is not a power of 2 in this case.
UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:
- 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
- 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
- 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
- 8 vCPUs (c3.2xlarge). It seems to work normally
- 4 vCPUs (c3.xlarge). It seems to work normally
- 2 vCPUs (m3.large). It seems to work normally
I created two test programs, based on yours, to compare System.Threading
and OTL
. I built with XE7 update 1, and OTL r1397. The OTL source that I used corresponds to release 3.04. I built with the 32 bit Windows compiler, using release build options.
My test machine is a dual Intel Xeon E5530 running Windows 7 x64. The system has two quad core processors. That's 8 processors in total, but the system says there are 16 due to hyper-threading. Experience tells me that hyper-threading is just marketing guff and I've never seen scaling beyond a factor of 8 on this machine.
Now for the two programs, which are almost identical.
System.Threading
program SystemThreadingTest;
{$APPTYPE CONSOLE}
uses
System.Diagnostics,
System.Threading;
const
maxItems = 5000;
DataSize = 100000;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
TParallel.For(1, maxItems,
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
OTL
program OTLTest;
{$APPTYPE CONSOLE}
uses
Winapi.Windows,
Winapi.Messages,
System.Diagnostics,
OtlParallel;
const
maxItems = 5000;
DataSize = 100000;
procedure ProcessThreadMessages;
var
msg: TMsg;
begin
while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
TranslateMessage(Msg);
DispatchMessage(Msg);
end;
end;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
Parallel.For(1, maxItems).Execute(
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
ProcessThreadMessages;
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
And now the output.
System.Threading output
Parallel matches: 19230817 in 374ms Serial matches: 19230817 in 2423ms Parallel matches: 19230698 in 374ms Serial matches: 19230698 in 2409ms Parallel matches: 19230556 in 368ms Serial matches: 19230556 in 2433ms Parallel matches: 19230635 in 2412ms Serial matches: 19230635 in 2430ms Parallel matches: 19230843 in 2441ms Serial matches: 19230843 in 2413ms Parallel matches: 19230905 in 2493ms Serial matches: 19230905 in 2423ms Parallel matches: 19231032 in 2430ms Serial matches: 19231032 in 2443ms Parallel matches: 19230669 in 2440ms Serial matches: 19230669 in 2473ms Parallel matches: 19230811 in 2404ms Serial matches: 19230811 in 2432ms ....
OTL output
Parallel matches: 19230667 in 422ms Serial matches: 19230667 in 2475ms Parallel matches: 19230663 in 335ms Serial matches: 19230663 in 2438ms Parallel matches: 19230889 in 395ms Serial matches: 19230889 in 2461ms Parallel matches: 19230874 in 391ms Serial matches: 19230874 in 2441ms Parallel matches: 19230617 in 385ms Serial matches: 19230617 in 2524ms Parallel matches: 19231021 in 368ms Serial matches: 19231021 in 2455ms Parallel matches: 19230904 in 357ms Serial matches: 19230904 in 2537ms Parallel matches: 19230568 in 373ms Serial matches: 19230568 in 2456ms Parallel matches: 19230758 in 333ms Serial matches: 19230758 in 2710ms Parallel matches: 19230580 in 371ms Serial matches: 19230580 in 2532ms Parallel matches: 19230534 in 336ms Serial matches: 19230534 in 2436ms Parallel matches: 19230879 in 368ms Serial matches: 19230879 in 2419ms Parallel matches: 19230651 in 409ms Serial matches: 19230651 in 2598ms Parallel matches: 19230461 in 357ms ....
I left the OTL version running for a long time and the pattern never changed. The parallel version was always around 7 times faster than the serial.
Conclusion
The code is astonishingly simple. The only reasonable conclusion that can be drawn is that the implementation of System.Threading
is defective.
There have been numerous bug reports relating to the new System.Threading
library. All the signs are that its quality is poor. Embarcadero have a long track record of releasing sub-standard library code. I'm thinking of TMonitor
, the XE3 string helper, earlier versions of System.IOUtils
, FireMonkey. The list goes on.
It seems clear that quality is a big problem with Embarcadero. Code is released that quite clearly has not been tested adequately, if at all. This is especially troublesome for a threading library where bugs can lie dormant and only be exposed in specific hardware/software configurations. The experience from TMonitor
leads me to believe that Embarcadero do not have sufficient expertise to produce high quality, correct, threading code.
My advice is that you should not use System.Threading
in its current form. Until such a time as it can be seen to have sufficient quality and correctness, it should be shunned. I suggest that you use OTL.
EDIT: Original OTL version of the program had a live memory leak which occurred because of an ugly implementation detail. Parallel.For creates tasks with the .Unobserved modifier. That causes said tasks to only be destroyed when some internal message window receives a 'task has terminated' message. This window is created in the same thread as the Parallel.For caller - i.e. in the main thread in this case. As the main thread was not processing messages, tasks were never destroyed and memory consumption (plus other resources) just piled up. It is possible that because of that program hanged after some time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With