So if I have something like <code>x=np.random.rand(60000)*400-200</code>. iPython's <code>%timeit</code> says: <ul> <li> <code>x.astype(int)</code> takes 0.14ms</li> <li> <code>np.rint(x)</code> and <code>np.around(x)</code> take 1.01ms</li> </ul> Note that in the <code>rint</code> and <code>around</code> cases you still need to spend the extra 0.14ms to do a final <code>astype(int)</code> (assuming that's what you ultimately want). Question: am I right in thinking that most modern hardware is capable of doing both operations in equal time. If so, why is numpy taking 8 times longer for the rounding? As it happens I'm not super fussy about the exactness of the arithmetic, but I can't see how to take advantage of that with numpy (I'm doing messy biology not particle physics).

<code>np.around(x).astype(int)</code> and <code>x.astype(int)</code> don't produce the same values. The former rounds even (it's the same as <code>((x*x>=0+0.5) + (x*x<0-0.5)).astype(int)</code>) whereas the latter rounds towards zero. However, <pre class="prettyprint"><code>y = np.trunc(x).astype(int) z = x.astype(int) </code></pre> shows <code>y==z</code> but calculating <code>y</code> is much slower. So it's the <code>np.trunc</code>and <code>np.around</code> functions which are slow. <pre class="prettyprint"><code>In [165]: x.dtype Out[165]: dtype('float64') In [168]: y.dtype Out[168]: dtype('int64') </code></pre> So <code>np.trunc(x)</code> rounds towards zero from double to double. Then <code>astype(int)</code> has to convert double to int64. Internally I don't know what python or numpy are doing but I know how I would do this in C. Let's discuss some hardware. With SSE4.1 it's possible to do round, floor, ceil, and trunc from double to double using: <pre class="prettyprint"><code>_mm_round_pd(a, 0); //round: round even _mm_round_pd(a, 1); //floor: round towards minus infinity _mm_round_pd(a, 2); //ceil: round towards positive infinity _mm_round_pd(a, 3); //trunc: round towards zero </code></pre> but numpy needs to support systems without SSE4.1 as well so it would have to build without SSE4.1 as well as with SSE4.1 and then use a dispatcher. But to do this from double directly to int64 using SSE/AVX is not efficient until AVX512. However, it is possible to round double to int32 efficiently using only SSE2: <pre class="prettyprint"><code>_mm_cvtpd_epi32(a); //round double to int32 then expand to int64 _mm_cvttpd_epi32(a); //trunc double to int32 then expand to int64 </code></pre> These converts two doubles to two int64. In your case this would work fine since the range is certainly within int32. But unless python knows the range fits in int32 it can't assume this so it would have to round or trunc to int64 which is slow. Also, once again numpy would have to build to support SSE2 to do this anyway. But maybe you could have used a single floating point array to begin with. In that case you could have done: <pre class="prettyprint"><code>_mm_cvtps_epi32(a); //round single to int32 _mm_cvttps_epi32(a) //trunc single to int32 </code></pre> These convert four singles to four int32. So to answer your question SSE2 can round or truncated from double to int32 efficiently. AVX512 will be able to round or truncated from double to int64 efficiently as well using <code>_mm512_cvtpd_epi64(a)</code> or <code>_mm512_cvttpd_epi64(a)</code>. SSE4.1 can round/trunc/floor/ceil from float to float or double to double efficiently.

numpy around/rint slow compared to astype(int)

2 Answers

np.around(x).astype(int) and x.astype(int) don't produce the same values. The former rounds even (it's the same as ((x*x>=0+0.5) + (x*x<0-0.5)).astype(int)) whereas the latter rounds towards zero. However,

y = np.trunc(x).astype(int)
z = x.astype(int)

shows y==z but calculating y is much slower. So it's the np.truncand np.around functions which are slow.

In [165]: x.dtype
Out[165]: dtype('float64')
In [168]: y.dtype
Out[168]: dtype('int64')

So np.trunc(x) rounds towards zero from double to double. Then astype(int) has to convert double to int64.

Internally I don't know what python or numpy are doing but I know how I would do this in C. Let's discuss some hardware. With SSE4.1 it's possible to do round, floor, ceil, and trunc from double to double using:

_mm_round_pd(a, 0); //round: round even
_mm_round_pd(a, 1); //floor: round towards minus infinity
_mm_round_pd(a, 2); //ceil:  round towards positive infinity
_mm_round_pd(a, 3); //trunc: round towards zero

but numpy needs to support systems without SSE4.1 as well so it would have to build without SSE4.1 as well as with SSE4.1 and then use a dispatcher.

But to do this from double directly to int64 using SSE/AVX is not efficient until AVX512. However, it is possible to round double to int32 efficiently using only SSE2:

_mm_cvtpd_epi32(a);  //round double to int32 then expand to int64
_mm_cvttpd_epi32(a); //trunc double to int32 then expand to int64

These converts two doubles to two int64.

In your case this would work fine since the range is certainly within int32. But unless python knows the range fits in int32 it can't assume this so it would have to round or trunc to int64 which is slow. Also, once again numpy would have to build to support SSE2 to do this anyway.

But maybe you could have used a single floating point array to begin with. In that case you could have done:

_mm_cvtps_epi32(a); //round single to int32
_mm_cvttps_epi32(a) //trunc single to int32

These convert four singles to four int32.

So to answer your question SSE2 can round or truncated from double to int32 efficiently. AVX512 will be able to round or truncated from double to int64 efficiently as well using _mm512_cvtpd_epi64(a) or _mm512_cvttpd_epi64(a). SSE4.1 can round/trunc/floor/ceil from float to float or double to double efficiently.

107

answered Oct 18 '22 04:10

Z boson

As pointed out by @jme in the comments, the rint and around functions must work out whether to round the fractions up or down to the nearest integer. In contrast, the astype function will always round down so it can immediately discard the decimal information. There are a number of other functions that do the same thing. Also, you could improve the speed by using a lower number of bits for the integer. However you must be careful that you can accommodate the full range of your input data.

%%timeit
np.int8(x)
10000 loops, best of 3: 165 µs per loop

Note, this does not store values outside the range -128 to 127 as it's 8-bit. Some values in your example fall outside this range.

Of all the others I tried, np.intc seems to be the fastest:

%%timeit
np.int16(x)
10000 loops, best of 3: 186 µs per loop

%%timeit
np.intc(x)
10000 loops, best of 3: 169 µs per loop

%%timeit
np.int0(x)
10000 loops, best of 3: 170 µs per loop

%%timeit
np.int_(x)
10000 loops, best of 3: 188 µs per loop

%%timeit
np.int32(x)
10000 loops, best of 3: 187 µs per loop

%%timeit
    np.trunc(x)
1000 loops, best of 3: 940 µs per loop

Your examples, on my machine:

%%timeit
np.around(x)
1000 loops, best of 3: 1.48 ms per loop

%%timeit
np.rint(x)
1000 loops, best of 3: 1.49 ms per loop

%%timeit
x.astype(int)
10000 loops, best of 3: 188 µs per loop

answered Oct 18 '22 04:10

atomh33ls

Related questions
                            
                                Flask test_client: Testing DELETE request with data
                            
                                Catching the exception thrown by python script in shell script
                            
                                Flask: Creating objects that remain over multiple requests
                            
                                Defining a gradient with respect to a subtensor in Theano
                            
                                How do I use colorbar with hist2d in matplotlib.pyplot?
                            
                                PyCharm - Have author appear before imports?
                            
                                ImportError: cannot import name murmurhash3_32
                            
                                Error "__init__ method from base class is not called" for an abstract class
                            
                                How can I find all Markdown links using regular expressions?
                            
                                Are Python Empty Immutables Singletons?
                            
                                How to parse winmail.dat with python
                            
                                create a database using pyodbc
                            
                                Plotting time series using Seaborn FacetGrid
                            
                                python-social-auth backend not found for linkedin oauth2
                            
                                parallel generation of random forests using scikit-learn
                            
                                What is the difference between 3/2 and -3/2?
                            
                                How to comment out multiple lines in a Linux config file?
                            
                                Handling SQL constraint exceptions in Odoo ORM
                            
                                Elastic Beanstalk unable to install packages
                            
                                json.dumps(): escaping forward slashes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numpy around/rint slow compared to astype(int)

Tags:

python

c

assembly

numpy

sse

dan-man

People also ask

2 Answers

Z boson

atomh33ls

Recent Activity

Donate For Us