I've encountered some (mysterious?) performance issue on NumPy matrix-vector multiplication.
I wrote the following snippet to test the speed of matrix-vector multiplication:
import timeit
for i in range(90, 101):
tm = timeit.repeat('np.matmul(a, b)', number = 10000,
setup = 'import numpy as np; a, b = np.random.rand({0},{0}), np.random.rand({0})'.format(i))
print(i, sum(tm) / 5)
In some machines, the result is normal:
90 0.08936462279998522
91 0.08872119059979014
92 0.09083068459967762
93 0.09311594780047017
94 0.09907015420012613
95 0.10136517100036144
96 0.10339414420013782
97 0.10627872140012187
98 0.1102267580001353
99 0.11277738099979615
100 0.11471197419996315
In some machines, the multiplication slowed down at size 96:
90 0.03618830284103751
91 0.03737151022069156
92 0.03295294055715203
93 0.02851409767754376
94 0.02677299762144685
95 0.028137388220056892
96 0.1916038074065
97 0.16719966367818415
98 0.18511182265356182
99 0.1806833743583411
100 0.17172936061397195
Some even slowed down by a factor of 1000:
90 0.04183819475583732
91 0.029678784403949977
92 0.02486871089786291
93 0.02882006801664829
94 0.028613184532150625
95 0.02956576123833656
96 31.16711748293601
97 27.803299666382372
98 31.368976181373
99 27.71114011341706
100 26.219610543036833
The Python / NumPy version is the same on all the machines I tested (3.7.2 / 1.16.2). The OS is also the same (Arch Linux).
What is the possible reason for this? And why this occurs at size 96?
At 96 your test reaches some software/hardware problem: 96*96*96 = 884,736. Close to 1M and multiply by 8 bytes for float number: 7,077,888. Intel i5 processor has 6 MB L3 cache. My iMac has this type of processor and has this slow down problem at 96 size. The Intel® Core™ i5-7200U Processor has 3 MB L3 cache and doesn't have this problem. So, it could be the software algorithm not correctly working with 6 MB cache size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With