Reading the interesting book "From Python to Numpy" I met an example, the description of which is as follows:
Let's consider a simple example where we want to clear all the values from an array which has the dtype
np.float32
. How does one write it to maximize speed?
The provided results surprised me, and when I rechecked them I got a completely different behavior. So, I asked the author to double-check, but he received the same results as before (OS X 10) in the table below:
The variants were timed on three different computers: mine (Win10, Win7) and author's (OSX 10.13.3). With Python 3.6.4 and numpy 1.14.2, where each variant was timed for fixed 100 loops, best of 3.
Edit: This question is not about the fact that on different computers, with different characteristics, I get different times - this is obvious :) The question is that the behavior is very different on the two operating systems - which is not so obvious? (if it is of course so, I would be glad if someone could double-check).
The setup was: Z = np.ones(4*1000000, np.float32)
| Variant | Windows 10 | Ubuntu 17.10 | Windows 7 | OSX 10.13.3 |
| | computer 1 | comp 2 | comp 3 |
| --------------------------- | ------------------------- | --------- | ----------- |
| Z.view(np.float64)[...] = 0 | 758 usec | 1.03 msec | 2.72 msec | 1.01 msec |
| Z.view(np.float32)[...] = 0 | 757 usec | 1.01 msec | 2.61 msec | 1.58 msec |
| Z.view(np.float16)[...] = 0 | 760 usec | 1.01 msec | 2.62 msec | 2.85 msec |
| Z.view(np.complex)[...] = 0 | 1.06 msec | 1.02 msec | 3.26 msec | 918 usec |
| Z.view(np.int64)[...] = 0 | 758 usec | 1.03 msec | 2.69 msec | 1 msec |
| Z.view(np.int32)[...] = 0 | 757 usec | 1.01 msec | 2.62 msec | 1.46 msec |
| Z.view(np.int16)[...] = 0 | 760 usec | 1.01 msec | 2.63 msec | 2.87 msec |
| Z.view(np.int8)[...] = 0 | 758 usec | 773 usec | 2.68 msec | 614 usec |
| Z.fill(0) | 747 usec | 998 usec | 2.55 msec | N/A |
| Z[...] = 0 | 750 usec | 1 msec | 2.59 msec | N/A |
As you can see from this table, on Windows the results doesn't depend on the viewed type, but on OS X this hack highly affects the performance. Can you provide the insight why this happens?
Edit: As I wrote above three computers are different.
The specs of the first computer:
Windows 10 and Ubuntu 17.10
CPU: Intel Xenon E5-1650v4 3.60GHz
RAM: 128GB DDR4-2400
The specs of the second computer:
Windows 7
CPU: Intel Pentium P6100 2.00GHz
RAM: 4GB DDR3-1333
The specs of the third computer: I don't have this information :)
Link to the issue
Edit 2: Add results for the first computer on Ubuntu 17.10.
Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
As you can see NumPy is incredibly fast, but always a bit slower than pure C.
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
NumPy Arrays Are NOT Always Faster Than Lists It is a common and very often used function. The script below demonstrates a comparison between the lists' append() and NumPy's append() . The code simply adds numbers from 0 to 99 999 to the end of a list and a NumPy array.
Keep in mind that Python is a very high-level programming language, Pandas being also a high-level framework.
What you're essentially given to work with is a high level API for many operations that you can perform with the language, without the need to worry about the underlying implementation.
If you were to work with a lower-level API, to assign an array to a variable you'd have to allocate some memory, create a structure to hold your data, link it together (probably using pointers to memory addresses). And you didn't even touch the actual chip, there's still virtual memory mapping being done between your API and the actual data being saved to the chip. And that complexity is applied to basically everything you're doing with Python & Pandas.
Yet, you only have to do arr = [1, 2, 3]
, and not worry about it.
Now Python is expected to work the same on every platform you run it - at least in most of the cases.
Now, after the boring introduction is behind us - the whole idea of "expose uniform API, don't worry about implementation" is widely spread in computer programming. There are some subtle implementation details that differ one operating system from another, which may or may not impact performance of your software. I don't expect that to be significant, but it's still there and worth mentioning.
For example, there's an old answer about np.dot
function performance differing between Linux and Windows. The author has way more knowledge than me on this subject, and points out that that particular function is a wrapper around CBLAS routines, which will use fastest routines available on given platform.
That being said - pandas is a very complex library which aims to make data analysis as simple as possible, through exposing simple-to-use API to the programmer. I expect that there are a lot more places where Pandas does a great job using the best mechanisms available on your platform to perform its tasks as fast as it can.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With