Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to MEMORY?

I'm trying to understand the memory and other overhead implications that using numpy lists would have for arrays of dtype object compared to lists of lists.

Does this change with dimensionality? eg 2D vs 3D vs N-D.

Some of the the benefits I can think of when using numpy arrays are that things like .shape, .T and that you can cast them as matrices with np.matrix much faster.

Is there anything else?

Also, if anyone is interested the object I'm using is:

import gmpy2 as gm
gm.mpfr( '0' )                                           # <-- this is the object

EDIT:

Just to clarify I'm interested in the case where the numpy array type is object not a native numpy-type.

EDIT 2:

Relevant follow up with regards to speed.

What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to SPEED?

like image 972
evan54 Avatar asked Nov 05 '14 21:11

evan54


People also ask

What are the drawbacks of using Python lists when compared to NumPy arrays?

NumPy uses much less memory to store data The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code. If this difference seems intimidating then prepare to have more.

What are the advantages of NumPy arrays over Python arrays and lists?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

What is the difference between the NumPy array and List?

While Python lists store a collection of ordered, alterable data objects, NumPy arrays only store a single type of object. So, we can say that NumPy arrays live under the lists' umbrella. Therefore, there is nothing NumPy arrays do lists do not.


1 Answers

I'm going to answer your primary question, and leave the others (performance of transpose, etc.) out. So:

I'm trying to understand the memory and other overhead implications that using numpy lists would have … Just to clarify I'm interested in the case where the numpy array type is object not a float, double or int

A Python list is an array of pointers to Python objects wrapping whatever actual values you've stored in it—plus some extra slack to allow it to be expanded on the fly efficiently. Let's call that slack 20%, just for the sake of easy computation. For example, an list of 10000 32-bit integers takes up, say, 96000 bytes for the array, plus around 240000 bytes for the Python integer objects, plus a small overhead for the list itself, say 80 bytes again.

A NumPy array is an array of whatever actual values you've stored in it. For example, an array of 10000 32-bit integers takes up 40000 bytes, plus a small overhead for the array itself, say 80 bytes. But when you're using dtype object, each "actual value" is just a pointer to a Python object, just as with a list.

So, the only real difference here is the slack: The array is going to use 320080 bytes, while the list is going to use 336080 bytes. Not a huge difference, but it can matter.


Also, does one become faster than the other in 2D vs ND or with the size along a given dimension.

Yes, the nested list will increase faster… but not by a huge amount.

A multidimensional array in numpy is stored as one giant array (in either C or Fortran striding order), so whether the shape is (10000,), (100, 100), or (10, 10, 10, 10), it's the same size. (The overhead may get a few bytes bigger to store more information about the striding, but if we're talking about, say, 256 bytes vs. 80 out of 320K, who cares?)

A nested list, on the other hand, has more lists, with slack and overhead at every level. For example, a list of 10 lists of 10 lists of 10 lists of integers has 1+10+100+1000 arrays of 12 pointers, and 1+10+100+1000 list headers.

So, the array is still using 320080 bytes, or maybe 320256, but the list is using 435536.


If you want to know more about how list is implemented… well, that depends on the implementation you're using. But in CPython, the C API pretty much guarantees that it's going to be storing a contiguous array of PyObject *, and the fact that appending is amortized constant time pretty much requires that it leave slack that grows proportionately. And you can see in the headers and the source that this is exactly what it does. (Also, keep in mind that the specific sizes you get from that source will generally depend on the platform you compile it on. Most importantly, because there are pointers all over the place, 64-bit platforms tend to have somewhere between 50-100% more overhead per object for most objects than 32-bit platforms.)

like image 123
abarnert Avatar answered Sep 25 '22 21:09

abarnert