Consider the following piece of code, which generates some (potentially) huge, multi-dimensional array and performs numpy.tensordot
with it (whether we multiply the same or two different arrays here, does not really matter).
import time
import numpy
L, N = 6, 4
shape = (2*L)*[N,]
A = numpy.arange(numpy.prod(shape)).reshape(shape)
A = A % 256 - 128 # [-127,+127]
axes=(range(1,2*L,2), range(0,2*L,2))
def run(dtype, repeat=1):
A_ = A.astype(dtype)
t = time.time()
for i in range(repeat):
numpy.tensordot(A_, A_, axes)
t = time.time() - t
print(dtype, ' \t%8.2f sec\t%8.2f MB' %(t, A_.nbytes/1e6))
Now we can compare the performance for different data types, e.g.:
run(numpy.float64)
run(numpy.int64)
Since the array only consists of small integer numbers, I would like to save some memory by using dtype=int8
. However, this slows down the matrix multiplication A LOT.
The first one, is the important one for my use case. The others are just for reference. Using Numpy 1.13.1 and Python 3.4.2
L, N = 6, 4; A.size = 4**12 = 16777216
<class 'numpy.float64'> 59.58 sec 134.22 MB
<class 'numpy.float32'> 44.19 sec 67.11 MB
<class 'numpy.int16'> 711.16 sec 33.55 MB
<class 'numpy.int8'> 647.40 sec 16.78 MB
Same array with different data types. Memory decreases as expected. But why the large differences in the CPU time? If anything I would expect int
to be faster than float
.
L, N = 1, 4**6; A.size = (4**6)**2 = 16777216
<class 'numpy.float64'> 57.95 sec 134.22 MB
<class 'numpy.float32'> 42.84 sec 67.11 MB
The shape doesn't seem to have a large effect.
L, N = 5, 4
<class 'numpy.float128'> 10.91 sec 16.78 MB
<class 'numpy.float64'> 0.98 sec 8.39 MB
<class 'numpy.float32'> 0.90 sec 4.19 MB
<class 'numpy.float16'> 9.80 sec 2.10 MB
<class 'numpy.int64'> 8.84 sec 8.39 MB
<class 'numpy.int32'> 5.55 sec 4.19 MB
<class 'numpy.int16'> 2.23 sec 2.10 MB
<class 'numpy.int8'> 1.82 sec 1.05 MB
Smaller values, but same weird trend.
L, N = 2, 4; A.size = 4**4 = 256; repeat=1000000
<class 'numpy.float128'> 17.92 sec 4.10 KB
<class 'numpy.float64'> 14.20 sec 2.05 KB
<class 'numpy.float32'> 12.21 sec 1.02 KB
<class 'numpy.float16'> 41.72 sec 0.51 KB
<class 'numpy.int64'> 14.21 sec 2.05 KB
<class 'numpy.int32'> 14.26 sec 1.02 KB
<class 'numpy.int16'> 13.88 sec 0.51 KB
<class 'numpy.int8'> 13.03 sec 0.26 KB
Other than float16
being much slower, everything is fine here.
Why is int8
for very large arrays so much slower? Is there any way around this? Saving memory becomes increasingly important for larger arrays!
Unfortunately,
as correctly underlined in the comments, the "engine" behind the scenes is BLAS, and it does not have native integer type. That's why The float64 or 32 will then run faster (some discussion in a related answer for a similar question for C++).
As a side note to the core of your question, a way to explore to speed up your problem while limiting the memory consumption is to go with Cython, where you can run C code directly and getting back the result in Python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With