Doing something like
import numpy as np a = np.random.rand(10**4, 10**4) b = np.dot(a, a)
uses multiple cores, and it runs nicely.
The elements in a
, though, are 64-bit floats (or 32-bit in 32-bit platforms?), and I'd like to multiply 8-bit integer arrays. Trying the following, though:
a = np.random.randint(2, size=(n, n)).astype(np.int8)
results in the dot product not using multiple cores, and thus running ~1000x slower on my PC.
array: np.random.randint(2, size=shape).astype(dtype) dtype shape %time (average) float32 (2000, 2000) 62.5 ms float32 (3000, 3000) 219 ms float32 (4000, 4000) 328 ms float32 (10000, 10000) 4.09 s int8 (2000, 2000) 13 seconds int8 (3000, 3000) 3min 26s int8 (4000, 4000) 12min 20s int8 (10000, 10000) It didn't finish in 6 hours float16 (2000, 2000) 2min 25s float16 (3000, 3000) Not tested float16 (4000, 4000) Not tested float16 (10000, 10000) Not tested
I understand NumPy uses BLAS, which doesn't support integers, but if I use the SciPy BLAS wrappers, ie.
import scipy.linalg.blas as blas a = np.random.randint(2, size=(n, n)).astype(np.int8) b = blas.sgemm(alpha=1.0, a=a, b=a)
the computation is multi-threaded. Now, blas.sgemm
runs with exactly the same timing as np.dot
for float32's, but for non-floats it converts everything to float32
and outputs floats, which is something np.dot
doesn't do. (In addition, b
is now in F_CONTIGUOUS
order, which is a lesser issue).
So, if I want to do integer matrix multiplication, I have to do one of the following:
np.dot
and be glad I get to keep the 8-bit integers.sgemm
and use up 4x memory.np.float16
and only use up 2x memory, with the caveat that np.dot
is much slower on float16 arrays than on float32 arrays, more so than int8.Can I follow option 4? Does such a library exist?
Disclaimer: I'm actually running NumPy + MKL, but I've tried a similar test on vanilly NumPy, with similar results.
Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.
numpy.dot
, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.Implementation:
import numpy as np from numpy.testing import assert_array_equal import threading from time import time def blockshaped(arr, nrows, ncols): """ Return an array of shape (nrows, ncols, n, m) where n * nrows, m * ncols = arr.shape. This should be a view of the original array. """ h, w = arr.shape n, m = h // nrows, w // ncols return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2) def do_dot(a, b, out): #np.dot(a, b, out) # does not work. maybe because out is not C-contiguous? out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array? def pardot(a, b, nblocks, mblocks, dot_func=do_dot): """ Return the matrix product a * b. The product is split into nblocks * mblocks partitions that are performed in parallel threads. """ n_jobs = nblocks * mblocks print('running {} jobs in parallel'.format(n_jobs)) out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype) out_blocks = blockshaped(out, nblocks, mblocks) a_blocks = blockshaped(a, nblocks, 1) b_blocks = blockshaped(b, 1, mblocks) threads = [] for i in range(nblocks): for j in range(mblocks): th = threading.Thread(target=dot_func, args=(a_blocks[i, 0, :, :], b_blocks[0, j, :, :], out_blocks[i, j, :, :])) th.start() threads.append(th) for th in threads: th.join() return out if __name__ == '__main__': a = np.ones((4, 3), dtype=int) b = np.arange(18, dtype=int).reshape(3, 6) assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b)) a = np.random.randn(1500, 1500).astype(int) start = time() pardot(a, a, 2, 4) time_par = time() - start print('pardot: {:.2f} seconds taken'.format(time_par)) start = time() np.dot(a, a) time_dot = time() - start print('np.dot: {:.2f} seconds taken'.format(time_dot))
With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:
running 8 jobs in parallel pardot: 5.45 seconds taken np.dot: 22.30 seconds taken
"Why is it faster to perform float by float matrix multiplication compared to int by int?" explains why integers are so slow: First, the CPUs have high-throughput floating point pipelines. Second, BLAS has no integer-type.
Workaround: Converting the matrices to float32
values gets big speedups. How's 90x speedup on a 2015 MacBook Pro? (Using float64
is half as good.)
import numpy as np import time def timeit(callable): start = time.time() callable() end = time.time() return end - start a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8) timeit(lambda: a.dot(a)) # ≈0.9 sec timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) ) # ≈0.01 sec
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With