I need to create a large numpy array containing random boolean values without hitting the swap.
My laptop has 8 GB of RAM. Creating a (1200, 2e6)
array takes less than 2 s and use 2.29 GB of RAM:
>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375
>>> dd.shape
(1200, 2000000)
For a relatively small (1200, 400e3)
, np.random.randint
is still quite fast, taking roughly 5 s to generate a 458 MB array:
db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'
But if I double the size of the array to (1200, 800e3)
I hit the swap, and it takes ~2.7 min to create db
;(
cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""
print timeit.Timer(cmd).timeit(1)
Using random.getrandbits
takes even longer (~8min), and also uses the swap:
from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)
Using np.random.randint
for a (1200, 2e6)
just gives a MemoryError
.
Is there a more efficient way to create a (1200, 2e6)
random boolean array?
1. NumPy uses much less memory to store data. The NumPy arrays takes significantly less amount of memory as compared to python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.
Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.
The size in memory of numpy arrays is easy to calculate. It's simply the number of elements times the data size, plus a small constant overhead. For example, if your cube. dtype is int64 , and it has 1,000,000 elements, it will require 1000000 * 64 / 8 = 8,000,000 bytes (8Mb).
One problem with using np.random.randint
is that it generates 64-bit integers, whereas numpy's np.bool
dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.
A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes
, which can be converted to an array of 8-bit integers using np.fromstring
. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being True
:
import numpy as np
def random_bool(shape, p=0.5):
n = np.prod(shape)
x = np.fromstring(np.random.bytes(n), np.uint8, n)
return (x < 255 * p).reshape(shape)
Benchmark:
In [1]: shape = 1200, int(2E6)
In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop
One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy).
An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool
using np.unpackbits
:
def fast_random_bool(shape):
n = np.prod(shape)
nb = -(-n // 8) # ceiling division
b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
return np.unpackbits(b)[:n].reshape(shape).view(np.bool)
For example:
In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With