Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

np.array arr.itemsize vs sys.getsizeof(arr[0])

Given an array

arr = array([  9.93418544e+00,   1.17237323e+01,   1.34554537e+01,
         2.43598467e+01,   2.72818286e+01,   3.11868750e+01,...])

When executed following commands I get some output:

arr.itemsize # 8
type(arr[0]) # numpy.float64
sys.getsizeof(np.float64()) # 32
sys.getsizeof(arr[0]) # 32
arr.dtype # dtype('float64')

It seems that itemsize doesn't work properly. I'm interested why does it happen?

I work with

print(sys.version)
3.5.5 | packaged by conda-forge | (default, Jul 24 2018, 01:52:17) [MSC v.1900 64 bit (AMD64)]
numpy==1.10.4
like image 957
user3102962 Avatar asked Sep 16 '25 03:09

user3102962


1 Answers

It seems that itemsize doesn't work properly.

It does, the different results are due to the fact that a Python object, is different from an item in numpy.

In Python, everything is an object. The data is "boxed". That means that for example for an int, we get:

>>> sys.getsizeof(2)
28

That is 28 bytes. That is a lot. In most programming languages, an int takes between two and eight bytes. If it is 32-bits int, then that takes 4 bytes.

But in Python, an object has a lot of "context". For example some bytes are used to denote the type of the object, etc.

Numpy however is not implemented in Python, it is not a library that uses Python objects itself. It is more a library implemented in C, and with a nice interface to Python. This thus means that a list [1, 4, 2, 5] is not stored in Python as a list with four references to int objects, but as an array, usually with "unboxed" elements. So the above will take, given the ints take 32 bits each, 4*32 bits and some extra space for the "context" around the array.

Items are thus stored in a more space efficient way. This makes processing the values easier, since we do not follow pointers here, but the values directly (there are ways to store references in an numpy array, but let us ignore that for now). Furthermore a numpy array takes thus far less memory than the equivalent Python list (together with the items it holds) would do.

If you however fetch an item from a numpy array, one needs to make a Python object for that. So that means that here it will construct a numpy.float64 object, that contains the value, but again a lot of "context" around that value. This results in using more memory.

The fact that numpy constructs an array of a certain type of objects, also has some consequences. For example if you use a numpy.int16, then that means one can not store values larger than 32767 into them, since that value can not be represented with a 16-bit 2-complement representation:

>>> np.int16(32767)
32767
>>> np.int16(32768)
-32768

Furthermore one can not - without using Python object references or some other "tricks" - construct an array that contains objects of different types. Numpy constructs for example an array of int16, so that means that it interprets the 160 bits as 10 16-bit numbers. In Python a list itself contains a reference to objects, and a Python object knows what type it is, so that means we can set the reference to another object, of another type.

like image 60
Willem Van Onsem Avatar answered Sep 17 '25 17:09

Willem Van Onsem