Here an example: <pre class="prettyprint"><code>import numpy as np randoms = np.random.randint(0, 20, 10000000) a = randoms.astype(np.int) b = randoms.astype(np.object) np.save('d:/dtype=int.npy', a) #39 mb np.save('d:/dtype=object.npy', b) #19 mb! </code></pre> You can see that the file with dtype=object is about half the size. How come? I was under the impression that properly defined numpy dtypes are strictly better than object dtypes.

With a non-object dtype, most of the npy file format consists of a dump of the raw bytes of the array's data. That'd be either 4 or 8 bytes per element here, depending on whether your NumPy defaults to 4- or 8-byte integers. From the file size, it looks like 4 bytes per element. With an object dtype, most of the npy file format consists of an ordinary pickle of the array. For small integers, such as those in your array, the pickle uses the <code>K</code> pickle opcode, long name <code>BININT1</code>, "documented" in the <code>pickletools</code> module: <pre class="prettyprint"><code>I(name='BININT1', code='K', arg=uint1, stack_before=[], stack_after=[pyint], proto=1, doc="""Push a one-byte unsigned integer. This is a space optimization for pickling very small non-negative ints, in range(256). """), </code></pre> This requires two bytes per integer, one for the <code>K</code> opcode and one byte of unsigned integer data. Note that you could have cut down the file size even further by storing your array with dtype <code>numpy.int8</code> or <code>numpy.uint8</code>, for roughly 1 byte per integer.

Why does a numpy array with dtype=object result in a much smaller file size than dtype=int?

Tags:

python

numpy

Here an example:

import numpy as np
randoms = np.random.randint(0, 20, 10000000)

a = randoms.astype(np.int)
b = randoms.astype(np.object)

np.save('d:/dtype=int.npy', a)     #39 mb
np.save('d:/dtype=object.npy', b)  #19 mb!

You can see that the file with dtype=object is about half the size. How come? I was under the impression that properly defined numpy dtypes are strictly better than object dtypes.

994

asked Jan 04 '17 21:01

Muppet

1 Answers

With a non-object dtype, most of the npy file format consists of a dump of the raw bytes of the array's data. That'd be either 4 or 8 bytes per element here, depending on whether your NumPy defaults to 4- or 8-byte integers. From the file size, it looks like 4 bytes per element.

With an object dtype, most of the npy file format consists of an ordinary pickle of the array. For small integers, such as those in your array, the pickle uses the K pickle opcode, long name BININT1, "documented" in the pickletools module:

I(name='BININT1',
  code='K',
  arg=uint1,
  stack_before=[],
  stack_after=[pyint],
  proto=1,
  doc="""Push a one-byte unsigned integer.

  This is a space optimization for pickling very small non-negative ints,
  in range(256).
  """),

This requires two bytes per integer, one for the K opcode and one byte of unsigned integer data.

Note that you could have cut down the file size even further by storing your array with dtype numpy.int8 or numpy.uint8, for roughly 1 byte per integer.

118

answered Sep 21 '22 03:09

user2357112 supports Monica

Related questions
                            
                                How to sum the nlargest() integers in groupby [duplicate]
                            
                                Django migrations. How to check if table exists in migrations?
                            
                                Python ElementTree "Invalid descendant" error
                            
                                Python Plotly Multiple Histogram with Mean Line
                            
                                How to sum the values of list to the power of their indices
                            
                                Detect if mouse has left Pygame window
                            
                                cumulative argmax of a numpy array
                            
                                Base64 Incorrect padding error using Python
                            
                                Python datetime and tzinfo objects (changing minutes instead of hours)
                            
                                Python equivalent for Matlab's Demcmap (elevation +/- appropriate colormap)
                            
                                Fit 3D Polynomial Surface with Python
                            
                                PyCharm Django project fails to run with debugging
                            
                                adding parameter to python callback
                            
                                How to get names of all the variables defined in methods of a class
                            
                                Django: IPv4 only for GenericIPAddressField
                            
                                Substitute for mutate (dplyr package) in python pandas
                            
                                How to get scrapy to use python 3 when both python versions are installed?
                            
                                Python 3 string index lookup is O(1)?
                            
                                Rename Excel worksheet using xlsxwriter
                            
                                How to push to remote repo with GitPython

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With