I need to create a large numpy array containing random boolean values without hitting the swap. My laptop has 8 GB of RAM. Creating a <code>(1200, 2e6)</code> array takes less than 2 s and use 2.29 GB of RAM: <pre class="prettyprint"><code>>>> dd = np.ones((1200, int(2e6)), dtype=bool) >>> dd.nbytes/1024./1024 2288.818359375 >>> dd.shape (1200, 2000000) </code></pre> For a relatively small <code>(1200, 400e3)</code>, <code>np.random.randint</code> is still quite fast, taking roughly 5 s to generate a 458 MB array: <pre class="prettyprint"><code>db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool) print db.nbytes/1024./1024., 'Mb' </code></pre> But if I double the size of the array to <code>(1200, 800e3)</code> I hit the swap, and it takes ~2.7 min to create <code>db</code> ;( <pre class="prettyprint"><code>cmd = """ import numpy as np db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool) print db.nbytes/1024./1024., 'Mb'""" print timeit.Timer(cmd).timeit(1) </code></pre> Using <code>random.getrandbits</code> takes even longer (~8min), and also uses the swap: <pre class="prettyprint"><code>from random import getrandbits db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool) </code></pre> Using <code>np.random.randint</code> for a <code>(1200, 2e6)</code> just gives a <code>MemoryError</code>. Is there a more efficient way to create a <code>(1200, 2e6)</code> random boolean array?

One problem with using <code>np.random.randint</code> is that it generates 64-bit integers, whereas numpy's <code>np.bool</code> dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary. A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using <code>np.random.bytes</code>, which can be converted to an array of 8-bit integers using <code>np.fromstring</code>. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being <code>True</code>: <pre class="prettyprint"><code>import numpy as np def random_bool(shape, p=0.5): n = np.prod(shape) x = np.fromstring(np.random.bytes(n), np.uint8, n) return (x < 255 * p).reshape(shape) </code></pre> Benchmark: <pre class="prettyprint"><code>In [1]: shape = 1200, int(2E6) In [2]: %timeit random_bool(shape) 1 loops, best of 3: 12.7 s per loop </code></pre> One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy). <hr> <h3>Update:</h3> An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to <code>np.bool</code> using <code>np.unpackbits</code>: <pre class="prettyprint"><code>def fast_random_bool(shape): n = np.prod(shape) nb = -(-n // 8) # ceiling division b = np.fromstring(np.random.bytes(nb), np.uint8, nb) return np.unpackbits(b)[:n].reshape(shape).view(np.bool) </code></pre> For example: <pre class="prettyprint"><code>In [3]: %timeit fast_random_bool(shape) 1 loops, best of 3: 5.54 s per loop </code></pre>

Memory-efficient way to generate a large numpy array containing random boolean values

Tags:

performance

python

random

boolean

numpy

I need to create a large numpy array containing random boolean values without hitting the swap.

My laptop has 8 GB of RAM. Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM:

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3), np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array:

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;(

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap:

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError.

Is there a more efficient way to create a (1200, 2e6) random boolean array?

375

asked Dec 27 '15 22:12

user3313834

1 Answers

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being True:

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x < 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy).

Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop

answered Oct 28 '22 14:10

ali_m

Related questions
                            
                                Find all coordinates within a circle in geographic data in python
                            
                                How to split a Python module into multiple files?
                            
                                How to include a template with relative path in Jinja2
                            
                                How to translate this Math Formula in Haskell or Python? (Was translated in PHP)
                            
                                How does Moose compare to Python's OO system? [closed]
                            
                                Hashing an immutable dictionary in Python
                            
                                Is it possible to create a numpy.ndarray that holds complex integers?
                            
                                Assign new values to slice from MultiIndex DataFrame
                            
                                pip freeze does not show all installed packages
                            
                                How to read a v7.3 mat file via h5py?
                            
                                Folder and files upload with Flask
                            
                                Most pythonic and/or performant way to assign a single value to a slice?
                            
                                Python ThreadPoolExecutor - is the callback guaranteed to run in the same thread as submitted func?
                            
                                How to upload small files to Amazon S3 efficiently in Python
                            
                                Will a UNICODE string just containing ASCII characters always be equal to the ASCII string?
                            
                                Problems obtaining most informative features with scikit learn?
                            
                                How to restrict Django Rest Framework browsable API interface to admin users
                            
                                How to print each line of a script as it is run only for the top-level script being run?
                            
                                Using Spyder IDE, how do you return from "goto definition"?
                            
                                Apache Spark throws NullPointerException when encountering missing feature

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With