A friend sent me a comparison between a recent version of Delphi and Java (source code available if you want it). Believe it or not (better believe it) Java is now significant faster than Delphi because Delphi compiler won't take advantage of modern CPU instructions! A big breakthrough for the 'slow' Java.
My question is: How can we use modern CPU instructions in Delphi WITHOUT resorting to ASM?
The FastCode project was a partial answer to the above question but it is now abandoned. There is any other project similar to FastCode?
This is another article showing that Java and C# it is indeed MUCH faster than Delphi: http://webandlife.blogspot.com/2011/12/c-performance-vs-delphi-performance.html
JAVA
import java.util.Date;
public class j
{
public static void xxx(int n, int m)
{
double t;
int i, j;
double d, r;
t = 0.0;
for (j = 1; j <= n; j++)
{
t = t / 1000.0;
for (i = 1; i <= m; i++)
{
t = t + i / 999999.0;
d = t * t + i;
r = (t + d) / (200000.0 * (i + 1));
t = t - r;
}
}
System.out.println(t);
}
public static void main(String [] args)
{
Date t1, t2;
t1 = new Date();
xxx(1, 999999999);
t2 = new Date();
System.out.println((t2.getTime() - t1.getTime())/1000);
t1 = new Date();
xxx(1, 999999999);
t2 = new Date();
System.out.println((t2.getTime() - t1.getTime())/1000);
}
}
25 sec
DELPHI
program d;
{$APPTYPE CONSOLE}
uses
System.SysUtils, System.DateUtils;
var
t1, t2: TDateTime;
procedure xxx (n: integer; m: integer);
var
t: double;
i, j: integer;
d, r: double;
begin
t:= 0.0;
for j:= 1 to n do
begin
t:= t / 1000.0;
for i:= 1 to m do
begin
t:= t + i / 999999.0;
d:= t * t + i;
r:= (t + d) / (200000.0 * (i + 1));
t:= t - r;
end;
end;
writeln(t);
end;
begin
t1:= Now;
xxx(1, 999999999);
t2:= Now;
writeln(SecondsBetween(t2,t1));
t1:= Now;
xxx(1, 999999999);
t2:= Now;
writeln(SecondsBetween(t2,t1));
end.
37 sec
It seems Delphi is still at the bottom of the chain: http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
I wonder how Lazarus compares with Delphi from this point of view.
According to your code, what is slow with the 32 bit Delphi compiler is the floating point arithmetic support, which is far from optimized, and copy a lot of content on/to the FPU stack.
In respect to floating point arithmetic, not only Java JITted code will be faster. Even modern JavaScript JIT compilers can be much better than Delphi!
This blog article is just a reference about this, and provide asm-level explanation about Delphi slowness for floating point:
But if you use the Delphi compiler targeting the Win64 platform, it will emit not x87 but SSE2 opcodes, and will be much faster. I suspect comparable to Java JITted executable.
And, in respect to Java, any Delphi executable will use much less memory than the JVM, so here, Delphi executables are perfectly on the track!
If you want your code to be faster, do not use asm nor low-level optimization trick, but change your algorithm. It could be order of magnitude faster than compilation hints. Dedicated process will be achieved with inlined asm opcodes - take a look at this great set of articles for such low level hacks. But it is not easy to master, and usually, proper software profiling and adding some cache is the best way to performance!
To follow on from Arnaud's point - I actually compiled this in delphi for x86 and x64.
32-bit compiler :
Unit1.pas.36: t:= t / 1000.0;
0051274D DD45F0 fld qword ptr [ebp-$10]
00512750 D835E4275100 fdiv dword ptr [$005127e4]
00512756 DD5DF0 fstp qword ptr [ebp-$10]
00512759 9B wait
Unit1.pas.37: for i:= 1 to m do
0051275A 8B45F8 mov eax,[ebp-$08]
0051275D 85C0 test eax,eax
0051275F 7E57 jle $005127b8
00512761 8945D0 mov [ebp-$30],eax
00512764 C745EC01000000 mov [ebp-$14],$00000001
Unit1.pas.39: t:= t + i / 999999.0;
0051276B DB45EC fild dword ptr [ebp-$14]
0051276E D835E8275100 fdiv dword ptr [$005127e8]
00512774 DC45F0 fadd qword ptr [ebp-$10]
00512777 DD5DF0 fstp qword ptr [ebp-$10]
0051277A 9B wait
Unit1.pas.40: d:= t * t + i;
0051277B DD45F0 fld qword ptr [ebp-$10]
0051277E DC4DF0 fmul qword ptr [ebp-$10]
00512781 DB45EC fild dword ptr [ebp-$14]
00512784 DEC1 faddp st(1)
00512786 DD5DE0 fstp qword ptr [ebp-$20]
00512789 9B wait
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
0051278A DD45F0 fld qword ptr [ebp-$10]
0051278D DC45E0 fadd qword ptr [ebp-$20]
00512790 8B45EC mov eax,[ebp-$14]
00512793 40 inc eax
00512794 8945CC mov [ebp-$34],eax
00512797 DB45CC fild dword ptr [ebp-$34]
0051279A D80DEC275100 fmul dword ptr [$005127ec]
005127A0 DEF9 fdivp st(1)
005127A2 DD5DD8 fstp qword ptr [ebp-$28]
005127A5 9B wait
Unit1.pas.42: t:= t - r;
005127A6 DD45F0 fld qword ptr [ebp-$10]
005127A9 DC65D8 fsub qword ptr [ebp-$28]
005127AC DD5DF0 fstp qword ptr [ebp-$10]
005127AF 9B wait
Unit1.pas.43: end;
005127B0 FF45EC inc dword ptr [ebp-$14]
Unit1.pas.37: for i:= 1 to m do
005127B3 FF4DD0 dec dword ptr [ebp-$30]
005127B6 75B3 jnz $0051276b
Unit1.pas.44: end;
005127B8 FF45E8 inc dword ptr [ebp-$18]
64-bit compiler
Unit1.pas.36: t:= t / 1000.0;
000000000059F94E F20F104548 movsd xmm0,qword ptr [rbp+$48]
000000000059F953 F20F5E05BD000000 divsd xmm0,qword ptr [rel $000000bd]
000000000059F95B F20F114548 movsd qword ptr [rbp+$48],xmm0
000000000059F960 C7C001000000 mov eax,$00000001
000000000059F966 8B5568 mov edx,[rbp+$68]
000000000059F969 894544 mov [rbp+$44],eax
000000000059F96C 395544 cmp [rbp+$44],edx
000000000059F96F 7F73 jnle xxx + $C4
000000000059F971 83C201 add edx,$01
Unit1.pas.39: t:= t + i / 999999.0;
000000000059F974 F20F2A4544 cvtsi2sd xmm0,dword ptr [rbp+$44]
000000000059F979 F20F5E059F000000 divsd xmm0,qword ptr [rel $0000009f]
000000000059F981 F20F104D48 movsd xmm1,qword ptr [rbp+$48]
000000000059F986 F20F58C8 addsd xmm1,xmm0
000000000059F98A F20F114D48 movsd qword ptr [rbp+$48],xmm1
Unit1.pas.40: d:= t * t + i;
000000000059F98F F20F104548 movsd xmm0,qword ptr [rbp+$48]
000000000059F994 F20F594548 mulsd xmm0,qword ptr [rbp+$48]
000000000059F999 F20F2A4D44 cvtsi2sd xmm1,dword ptr [rbp+$44]
000000000059F99E F20F58C1 addsd xmm0,xmm1
000000000059F9A2 F20F114538 movsd qword ptr [rbp+$38],xmm0
Unit1.pas.41: r:= (t + d) / (200000.0 * (i + 1));
000000000059F9A7 F20F104548 movsd xmm0,qword ptr [rbp+$48]
000000000059F9AC F20F584538 addsd xmm0,qword ptr [rbp+$38]
000000000059F9B1 8B4544 mov eax,[rbp+$44]
000000000059F9B4 83C001 add eax,$01
000000000059F9B7 F20F2AC8 cvtsi2sd xmm1,eax
000000000059F9BB F20F590D65000000 mulsd xmm1,qword ptr [rel $00000065]
000000000059F9C3 F20F5EC1 divsd xmm0,xmm1
000000000059F9C7 F20F114530 movsd qword ptr [rbp+$30],xmm0
Unit1.pas.42: t:= t - r;
000000000059F9CC F20F104548 movsd xmm0,qword ptr [rbp+$48]
000000000059F9D1 F20F5C4530 subsd xmm0,qword ptr [rbp+$30]
000000000059F9D6 F20F114548 movsd qword ptr [rbp+$48],xmm0
Unit1.pas.43: end;
000000000059F9DB 83454401 add dword ptr [rbp+$44],$01
000000000059F9DF 395544 cmp [rbp+$44],edx
000000000059F9E2 7590 jnz xxx + $54
000000000059F9E4 90 nop
Unit1.pas.44: end;
000000000059F9E5 83454001 add dword ptr [rbp+$40],$01
000000000059F9E9 394D40 cmp [rbp+$40],ecx
000000000059F9EC 0F855CFFFFFF jnz xxx + $2E
000000000059F9F2 90 nop
Unit1.pas.45: writeln(t);
000000000059F9F3 488B0D9E150300 mov rcx,[rel $0003159e]
Oddly, in this case, the x87 fpu code was actually about ~5% faster. The conclusion is probably nothing more than the fact that Delphi's 32-bit/x87 compiler is extremely mature and rather well optimized and the 64-bit compiler probably has some room to refine in a bit of performance. I can easily see a few places where the SSE code can be optimized here; i
, for example, could be stored in an XMM register and re-used rather than being re-converted each time with cvtsi2sd
, d
could be held in an XMM register for the next calculation rather than being stored and re-loaded, etc.
Unaligned MOV
s in and out of XMM registers can be actualy surprisingly expensive. The actual SSE calculations are faster, but the excessive data movement is probably evening out the score. Maybe Java forces 16-byte alignment on the stack? I know MacOS does this and there are definite benefits for SSE to use aligned rather than unaligned moves (at the expense of consuming more stack space, of course).
For example
fild
: 1 op, 9 latency (x87)cvtsi2sd
: 2 op, 12 latency (SSE)or
fld
: 1 op, 4 latency (x87)movsd
[r,m] : 2op, 4 latency (SSE)Delphi's compiler, while emitting SSE instructions, still seems to be treating the workflow in a similar way to the way it would do so with the x87 unit, which isn't necessarily the best way. In either case, David is correct - the compiler is what it is. You can't do anything to change it.
Where I need fast mathematical routines, I still code them in ASM myself - this will generally be superior to anything any compiler can do because you can customize the behaviour to the exact calculations you are doing. I have old legacy 32-bit applications with hand-tuned SSE3 ASM algorithms for complex number arithmetic and matrix operations. The key is that you don't need to optimize everything - you only need to optimize the bottlenecks. This is a rather important point to keep aware of.
I'm going to answer the meta question here: "why can't the Delphi compiler use more modern CPU instructions, and why can Java?"
Basically, there are two ways to compile code:
Examples of 1. include Delphi, C/C++, etc.
Examples of 2. include Java, .NET, JavaScript, etc.
Pre-compiled environments
Pre-compiled environments have you compile your code once, and run it on your target computers. Compiled programs cannot run on machines using older instruction sets than the compiled program uses. The minimum requirement is the lowest of what the compiler can do best, and the lowest architecture of all your the target machines. If you don't know your target machines, it limited by the compiler.
Post-compiled environments
Post-compiled environments compile on the target machine. You don't have to know what architecture it runs: the compiler that runs on it needs to know what it supports to take best benefit of it. The minimum requirement is the lowest of what the compiler can do best, and the architecture of the target machine.
The reason is that in a post-compiled, JITed or interpreted language environment, the compiler actually runs on the target machine. Which means that the compiler can use all features of that target architecture. It could even take into account aspects like physical memory, cache sizes or disk speed, and measure current run-time performance to do post-compile optimizations on running code.
Delphi and other tools
As for the Windows 32-bit Delphi compiler, I think the minimum requirement is 486 still or the Pentium (given the Pentium-Safe FDIV option). It uses x87 for CPU code because of that.
The Windows 64-bit Delphi compiler has a minimum requirement of SSE instructions which it uses for FPU code.
I've not yet checked the other compiler platforms on minimum requirements.
The minimum Delphi requirements have to do with the emphasis on backward compatibility.
Some other environments (most C/++ compilers, and likely others too) allow you to specify the minimum instruction set. Delphi hasn't. and I think the main reason is complexity of development and testing. The matrix (if it is indeed a 2-dimensional problem) of possibilities gets large really quickly.
JIT compilers usually do not fully support the latest hardware architecture benefits across the board, as it is very expensive to do so.
JIT compilers often do support optimizations for certain processor families (for instance when copying regions of memory).
I know that Java and .NET made quite some advances in this area over the last decade. There is a really good 2005 article on .NET JIT usage of CPU features.
Background.
The question is how to optimize Delphi code to make it comparable to Java. And doing so without using asm coding.
Analysis.
In the given example, the algorithm is using floating point calculations. The performance and weak points in the compiler has been investigated in other answers. Theoretically the x64 bit compiler could perform better, since the SSE2 opcodes and registers can offer better optimization. So the compiler would be the bottleneck here.
It was also suggested that a better algorithm could improve the performance. Let's look at this a little bit more.
Improving algorithm.
In the algorithm loop, the loop index i is used three times as a variable in the calculations. Since this forces an integer to float conversion each time (upon loading into a fpu or SSE2 register), it will have a big impact on performance. Let's investigate if we can help the compiler to optimize away those conversions.
procedure xxx (n: integer; m: integer);
var
t,ii: double;
i, j: integer;
d, r: double;
begin
t:= 0.0;
for j:= 1 to n do
begin
t:= t / 1000.0;
ii:= 1.0;
for i:= 1 to m do
begin
t:= t + ii / 999999.0;
d:= t * t + ii;
ii:= ii + 1.0;
r:= (t + d) / (200000.0 * ii);
t:= t - r;
end;
end;
writeln(t);
end;
Now we have a clean algorithm using only float values. For reference here is the java code:
public static void xxy(int n, int m)
{
double t;
int i, j;
double d, r, ii;
t = 0.0;
for (j = 1; j <= n; j++)
{
t = t / 1000.0;
ii = 1.0;
for (i = 1; i <= m; i++)
{
t = t + ii / 999999.0;
d = t * t + ii;
ii = ii + 1.0;
r = (t + d) / (200000.0 * ii);
t = t - r;
}
}
System.out.println(t);
}
Benchmark.
Using XE2 compiler.
x32 x64 Java(x64)
--------------------------
Original algorithm 23417ms 22293ms 22045ms
Updated algorithm 22362ms 14059ms 15507ms
The disassembly for the x64 code looks like this:
Project19.dpr.11: begin
000000000046ABC0 55 push rbp
000000000046ABC1 4883EC20 sub rsp,$20
000000000046ABC5 488BEC mov rbp,rsp
Project19.dpr.12: t:= 0.0;
000000000046ABC8 F20F1005B0000000 movsd xmm0,qword ptr [rel $000000b0]
000000000046ABD0 C7C001000000 mov eax,$00000001
000000000046ABD6 4189C8 mov r8d,ecx
000000000046ABD9 89C1 mov ecx,eax
000000000046ABDB 413BC8 cmp ecx,r8d
000000000046ABDE 7F7B jnle xxx + $9B
000000000046ABE0 4183C001 add r8d,$01
Project19.dpr.15: t:= t / 1000.0;
000000000046ABE4 F20F5E059C000000 divsd xmm0,qword ptr [rel $0000009c]
Project19.dpr.16: ii := 1.0;
000000000046ABEC F20F100D9C000000 movsd xmm1,qword ptr [rel $0000009c]
000000000046ABF4 C7C001000000 mov eax,$00000001
000000000046ABFA 4189D1 mov r9d,edx
000000000046ABFD 413BC1 cmp eax,r9d
000000000046AC00 7F50 jnle xxx + $92
000000000046AC02 4183C101 add r9d,$01
Project19.dpr.19: t:= t + ii / 999999.0;
000000000046AC06 660F28D1 movapd xmm2,xmm1
000000000046AC0A F20F5E1586000000 divsd xmm2,qword ptr [rel $00000086]
000000000046AC12 F20F58C2 addsd xmm0,xmm2
Project19.dpr.20: d:= t * t + ii;
000000000046AC16 660F28D0 movapd xmm2,xmm0
000000000046AC1A F20F59D0 mulsd xmm2,xmm0
000000000046AC1E F20F58D1 addsd xmm2,xmm1
Project19.dpr.21: ii := ii + 1.0;
000000000046AC22 F20F580D66000000 addsd xmm1,qword ptr [rel $00000066]
Project19.dpr.22: r:= (t + d) / (200000.0 * ii);
000000000046AC2A 660F28D8 movapd xmm3,xmm0
000000000046AC2E F20F58DA addsd xmm3,xmm2
000000000046AC32 660F28D1 movapd xmm2,xmm1
000000000046AC36 F20F591562000000 mulsd xmm2,qword ptr [rel $00000062]
000000000046AC3E F20F5EDA divsd xmm3,xmm2
000000000046AC42 660F29DA movapd xmm2,xmm3
Project19.dpr.23: t:= t - r;
000000000046AC46 F20F5CC2 subsd xmm0,xmm2
Project19.dpr.24: end;
000000000046AC4A 83C001 add eax,$01
000000000046AC4D 413BC1 cmp eax,r9d
000000000046AC50 75B4 jnz xxx + $46
000000000046AC52 90 nop
Project19.dpr.25: end;
000000000046AC53 83C101 add ecx,$01
000000000046AC56 413BC8 cmp ecx,r8d
000000000046AC59 7589 jnz xxx + $24
000000000046AC5B 90 nop
Project19.dpr.26: WriteLn(t);
000000000046AC5C 488B0DC5100100 mov rcx,[rel $000110c5]
000000000046AC63 660F29C1 movapd xmm1,xmm0
000000000046AC67 E874D7F9FF call @Write0Ext
000000000046AC6C 4889C1 mov rcx,rax
000000000046AC6F E88CD7F9FF call @WriteLn
000000000046AC74 E877AFF9FF call @_IOTest
Project19.dpr.27: end;
000000000046AC79 488D6520 lea rsp,[rbp+$20]
The extra integer to float conversions are gone and the registers are better used.
Extra optimization
For the x32 bit compiler, replacing 999999.0;
and 200000.0
with reciprocal constants (const cA : Double = 1.0/999999.0; cB : Double = 1.0/200000.0;) and using multiplication instead will also improve performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With