Preamble: I was watching the first class of this MIT course on youtube and around the time 22:20 the teacher presents the results of a performance comparison of the same algorithm in Java and C. He said that the C implementation is 2x faster than the Java implementation, so I tried it at home to see with my own eyes, but the fact is that the Java implementation turned out to be faster than the C one, and I'm seeking for help to understand why.
The problem implementations
I have these two implementations of the same algorithm:
C
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define n 1024
double A[n][n];
double B[n][n];
double C[n][n];
float tdiff(struct timeval *start,
struct timeval *end)
{
return (end->tv_sec - start->tv_sec) +
1e-6 * (end->tv_usec - start->tv_usec);
}
int main(int argc, const char *argv[])
{
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
A[i][j] = (double)rand() / (double)RAND_MAX;
B[i][j] = (double)rand() / (double)RAND_MAX;
C[i][j] = 0;
}
}
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
for (int k = 0; k < n; ++k)
{
C[i][j] += A[i][k] * B[k][j];
}
}
}
gettimeofday(&end, NULL);
printf("%0.6f\n", tdiff(&start, &end));
double d = 0;
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
d += C[i][j];
}
}
printf("Fim %0.6f\n", d);
return 0;
}
Java
import java.util.Random;
public class TesteMatrixMultiply{
public final static int n = 1024;
static double[][] A = new double[n][n];
static double[][] B = new double[n][n];
static double[][] C = new double[n][n];
public static void main (String args[])
{
Random r = new Random();
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
A[i][j] = r.nextDouble();
B[i][j] = r.nextDouble();
C[i][j] = 0;
}
}
long start = System.nanoTime();
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
for (int k = 0; k < n; ++k)
{
C[i][j] += A[i][k] * B[k][j];
}
}
}
long stop = System.nanoTime();
double tdiff = (stop - start) * 1e-9;
System.out.println(tdiff);
double d = 0;
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < n; ++j)
{
d+= C[i][j];
}
}
System.out.printf("Fim %.6f\n", d);
}
}
I Run the two implementations using the Linux perf tool, so I could collect some info.
C execution
16.867058
Fim 268364458.846206
Performance counter stats for './teste-matrix-multiply':
16896,010532 task-clock (msec) # 1,000 CPUs utilized
76 context-switches # 0,004 K/sec
2 cpu-migrations # 0,000 K/sec
6.196 page-faults # 0,367 K/sec
38.943.668.123 cycles # 2,305 GHz (49,98%)
27.050.004.884 instructions # 0,69 insn per cycle (62,51%)
2.194.116.427 branches # 129,860 M/sec (62,51%)
1.323.195 branch-misses # 0,06% of all branches (62,50%)
11.889.108.593 L1-dcache-loads # 703,664 M/sec (62,51%)
1.089.638.099 L1-dcache-load-misses # 9,17% of all L1-dcache hits (62,52%)
1.082.875.343 LLC-loads # 64,091 M/sec (49,99%)
835.764.813 LLC-load-misses # 77,18% of all LL-cache hits (49,99%)
16,898358467 seconds time elapsed
Java execution
6.551244991000001
Fim 268384388.815752
Performance counter stats for 'java TesteMatrixMultiply':
6754,506236 task-clock (msec) # 1,013 CPUs utilized
581 context-switches # 0,086 K/sec
27 cpu-migrations # 0,004 K/sec
12.938 page-faults # 0,002 M/sec
23.531.901.353 cycles # 3,484 GHz (49,66%)
19.009.664.164 instructions # 0,81 insn per cycle (62,09%)
3.342.430.354 branches # 494,845 M/sec (62,07%)
3.524.682 branch-misses # 0,11% of all branches (62,32%)
7.752.842.493 L1-dcache-loads # 1147,803 M/sec (62,73%)
2.676.041.100 L1-dcache-load-misses # 34,52% of all L1-dcache hits (62,88%)
1.067.905.566 LLC-loads # 158,103 M/sec (50,34%)
55.152.268 LLC-load-misses # 5,16% of all LL-cache hits (50,00%)
6,669662205 seconds time elapsed
Compilations
The Java implementation was compiled with:
library: zulu11.35.13-ca-jdk11.0.5-linux_x64
call: javac TesteMatrixMultiply.java
The C implementation was compiled with:
library: clang+llvm-10.0.0-x86_64-linux-gnu-ubuntu-18.04
call: clang teste-matrix-multiply.c -o teste-matrix-multiply-clang
My CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Stepping: 9
CPU MHz: 2100.157
CPU max MHz: 3500,0000
CPU min MHz: 400,0000
BogoMIPS: 5808.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
It's clear that the C execution suffers a lot from cache misses, but I can't understand why Java execution doesn't have the same problem.
Update 1
I really didn't know about the -O3 option before,Now I recompiled using it and the performance has been improved a lot, but the cache misses are still high.
7.736069
Fim 268364458.846206
Performance counter stats for './teste-matrix-multiply':
7801,043965 task-clock (msec) # 1,000 CPUs utilized
10 context-switches # 0,001 K/sec
0 cpu-migrations # 0,000 K/sec
6.199 page-faults # 0,795 K/sec
13.733.804.981 cycles # 1,761 GHz (49,96%)
5.545.160.735 instructions # 0,40 insn per cycle (62,47%)
578.468.520 branches # 74,153 M/sec (62,47%)
1.231.327 branch-misses # 0,21% of all branches (62,47%)
2.193.986.457 L1-dcache-loads # 281,243 M/sec (62,53%)
1.132.220.799 L1-dcache-load-misses # 51,61% of all L1-dcache hits (62,55%)
1.158.935.692 LLC-loads # 148,562 M/sec (50,04%)
683.217.415 LLC-load-misses # 58,95% of all LL-cache hits (49,98%)
7,801911690 seconds time elapsed
It is still about 1sec behind Java, even though we are not considering the warmup of the JVM.
I know that it depends on my specific hardware. But I'd like to be able to explain why I'm getting these numbers. What tools, or diagnostic commands I should look at? Imagine that it is not my notebook, it is a production server and I should be able to understand what is happening. What should I do? Optimize the code is not a option on this exercise.
I'm unable to reproduce your results. On my system, I get results similar to the video.
Are you sure you're compiling the C code with optimization (e.g. -O3)?
Java output:
17.902893012
Fim 268642903.459048
Without optimization, the C output (from gcc) is:
22.350832
Fim 268364458.846206
The C code output (with -O3):
5.077404
Fim 268364458.846206
However, clang on my system is a bit slower than gcc (~2x slower) ...
Without optimization (from clang):
19.541418
Fim 268364458.846206
With -O3 (from clang):
10.505201
Fim 268364458.846206
My system is linux, fedora 29 (uname -a):
Linux myhost 5.3.11-100.fc29.x86_64 #1 SMP Tue Nov 12 20:41:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
gcc version is:
gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
clang version is:
clang version 7.0.1 (Fedora 7.0.1-6.fc29)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
javac version is:
javac 1.8.0_232
java version is:
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
I'm on a 4 core linux machine with hyperthreading, so 8 effective cores. Here is the first part of my /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
stepping : 5
microcode : 0x1d
cpu MHz : 2786.179
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5307.00
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With