Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a numpy array with dtype=object result in a much smaller file size than dtype=int?

Tags:

python

numpy

Here an example:

import numpy as np
randoms = np.random.randint(0, 20, 10000000)

a = randoms.astype(np.int)
b = randoms.astype(np.object)

np.save('d:/dtype=int.npy', a)     #39 mb
np.save('d:/dtype=object.npy', b)  #19 mb! 

You can see that the file with dtype=object is about half the size. How come? I was under the impression that properly defined numpy dtypes are strictly better than object dtypes.

like image 994
Muppet Avatar asked Jan 04 '17 21:01

Muppet


People also ask

What is Dtype object in Numpy?

A data type object (an instance of numpy. dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.)

Why does Numpy take less space?

1. NumPy uses much less memory to store data. The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.

Can Numpy array change size?

there is no converting the dimensions of a numpy array in python. A numpy array is simply a section of your RAM. You can't append to it in the sense of literally adding bytes to the end of the array, but you can create another array and copy over all the data (which is what np. append(), or np.

Are Numpy arrays smaller than lists?

This shows some performance numbers of operations between Python and Numpy. Notice how the 2nd set of numbers (NumPy) are always smaller - meaning they have much better performance than their Python List core library conterparts.


1 Answers

With a non-object dtype, most of the npy file format consists of a dump of the raw bytes of the array's data. That'd be either 4 or 8 bytes per element here, depending on whether your NumPy defaults to 4- or 8-byte integers. From the file size, it looks like 4 bytes per element.

With an object dtype, most of the npy file format consists of an ordinary pickle of the array. For small integers, such as those in your array, the pickle uses the K pickle opcode, long name BININT1, "documented" in the pickletools module:

I(name='BININT1',
  code='K',
  arg=uint1,
  stack_before=[],
  stack_after=[pyint],
  proto=1,
  doc="""Push a one-byte unsigned integer.

  This is a space optimization for pickling very small non-negative ints,
  in range(256).
  """),

This requires two bytes per integer, one for the K opcode and one byte of unsigned integer data.

Note that you could have cut down the file size even further by storing your array with dtype numpy.int8 or numpy.uint8, for roughly 1 byte per integer.

like image 118
user2357112 supports Monica Avatar answered Sep 21 '22 03:09

user2357112 supports Monica