An efficient way of making a large random bytearray

Tags:

I need to create a large bytearry of a specific size but the size is not known prior to run time. The bytes need to be fairly random. The bytearray size may be as small as a few KBs but as large as a several MB. I do not want to iterate byte-by-byte. This is too slow -- I need performance similar to numpy.random. However, I do not have the numpy module available for this project. Is there something part of a standard python install that will do this? Or do i need to compile my own using C?

for those asking for timings:

>>> timeit.timeit('[random.randint(0,128) for i in xrange(1,100000)]',setup='import random', number=100) 35.73110193696641 >>> timeit.timeit('numpy.random.random_integers(0,128,100000)',setup='import numpy', number=100) 0.5785652013481126 >>>

776

asked Aug 12 '11 17:08

Paul

2 Answers

The os module provides urandom, even on Windows:

bytearray(os.urandom(1000000))

This seems to perform as quickly as you need, in fact, I get better timings than your numpy (though our machines could be wildly different):

timeit.timeit(lambda:bytearray(os.urandom(1000000)), number=10) 0.0554857286941

148

answered Oct 02 '22 14:10

Ned Batchelder

There are several possibilities, some faster than os.urandom. Also consider whether the data has to be generated deterministically from a random seed. This is invaluable for unit tests where failures have to be reproducible.

short and pithy:

lambda n:bytearray(map(random.getrandbits,(8,)*n))

I've use the above for unit tests and it was fast enough but can it be done faster?

using itertools:

lambda n:bytearray(itertools.imap(random.getrandbits,itertools.repeat(8,n))))

itertools and struct producing 8 bytes per iteration

lambda n:(b''.join(map(struct.Struct("!Q").pack,itertools.imap(     random.getrandbits,itertools.repeat(64,(n+7)//8)))))[:n]

Anything based on b''.join will fill 3-7x the memory consumed by the final bytearray with temporary objects since it queues up all the sub-strings before joining them together and python objects have lots of storage overhead.

Producing large chunks with a specialized function gives better performance and avoids filling memory.

import random,itertools,struct,operator def randbytes(n,_struct8k=struct.Struct("!1000Q").pack_into):     if n<8000:         longs=(n+7)//8         return struct.pack("!%iQ"%longs,*map(             random.getrandbits,itertools.repeat(64,longs)))[:n]     data=bytearray(n);     for offset in xrange(0,n-7999,8000):         _struct8k(data,offset,             *map(random.getrandbits,itertools.repeat(64,1000)))     offset+=8000     data[offset:]=randbytes(n-offset)     return data

Performance

.84 MB/s :original solution with randint:
4.8 MB/s :bytearray(getrandbits(8) for _ in xrange(n)): (solution by other poster)
6.4MB/s :bytearray(map(getrandbits,(8,)*n))
7.2 MB/s :itertools and getrandbits
10 MB/s :os.urandom
23 MB/s :itertools and struct
35 MB/s :optimised function (holds for len = 100MB ... 1KB)

Note:all tests used 10KB as the string size. Results were consistent up till intermediate results filled memory.

Note:os.urandom is meant to provide secure random seeds. Applications expand that seed with their own fast PRNG. Here's an example, using AES in counter mode as a PRNG:

import os seed=os.urandom(32)  from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes from cryptography.hazmat.backends import default_backend backend = default_backend() cipher = Cipher(algorithms.AES(seed), modes.CTR(b'\0'*16), backend=backend) encryptor = cipher.encryptor()  nulls=b'\0'*(10**5) #100k from timeit import timeit t=timeit(lambda:encryptor.update(nulls),number=10**5) #1GB, (100K*10k) print("%.1f MB/s"%(1000/t))

This produces pseudorandom data at 180 MB/s. (no hardware AES acceleration, single core) That's only ~5x the speed of the pure python code above.

Addendum

There's a pure python crypto library waiting to be written. Putting the above techniques together with hashlib and stream cipher techniques looks promising. Here's a teaser, a fast string xor (42MB/s).

def xor(a,b):     s="!%iQ%iB"%divmod(len(a),8)     return struct.pack(s,*itertools.imap(operator.xor,         struct.unpack(s,a),         struct.unpack(s,b)))

answered Oct 02 '22 15:10

Richard Thiessen

Related questions
                            
                                TensorFlow operator overloading
                            
                                TensorFlow wasn't compiled to use SSE (etc.) instructions, but these are available
                            
                                What is __qualname__ in python?
                            
                                What is the best solution for database connection pooling in python?
                            
                                RAII in Python - automatic destruction when leaving a scope
                            
                                Python cannot find dateutil.relativedelta
                            
                                Flask redirecting multiple routes
                            
                                Why is the id of a Python class not unique when called quickly?
                            
                                How to use alter_column in alembic?
                            
                                Multiple files for one argument in argparse Python 2.7
                            
                                pandas left join and update existing column
                            
                                django setting environment variables in unittest tests
                            
                                Efficiently count word frequencies in python
                            
                                How can I use PyCharm to locally debug a Celery worker? [duplicate]
                            
                                Inconsistency between $ and ^ in regex when using start/end arguments to re.search?
                            
                                find intersection point of two lines drawn using houghlines opencv
                            
                                cannot import name 'mydb' from partially initialized module 'connection' in Python
                            
                                How can I listen for 'usb device inserted' events in Linux, in Python?
                            
                                python: get the print output in an exec statement
                            
                                Python Django Templates and testing if a variable is null or empty string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

An efficient way of making a large random bytearray

Tags:

python

random

bytearray