Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An efficient way of making a large random bytearray

I need to create a large bytearry of a specific size but the size is not known prior to run time. The bytes need to be fairly random. The bytearray size may be as small as a few KBs but as large as a several MB. I do not want to iterate byte-by-byte. This is too slow -- I need performance similar to numpy.random. However, I do not have the numpy module available for this project. Is there something part of a standard python install that will do this? Or do i need to compile my own using C?

for those asking for timings:

>>> timeit.timeit('[random.randint(0,128) for i in xrange(1,100000)]',setup='import random', number=100) 35.73110193696641 >>> timeit.timeit('numpy.random.random_integers(0,128,100000)',setup='import numpy', number=100) 0.5785652013481126 >>>  
like image 776
Paul Avatar asked Aug 12 '11 17:08

Paul


People also ask

How do you generate random bytes in Python?

urandom() method is used to generate a string of size random bytes suitable for cryptographic use or we can say this method generates a string containing random characters. Return Value: This method returns a string which represents random bytes suitable for cryptographic use.

What is a Bytearray?

The bytearray type is a mutable sequence of integers in the range between 0 and 255. It allows you to work directly with binary data. It can be used to work with low-level data such as that inside of images or arriving directly from the network. Bytearray type inherits methods from both list and str types.

How do you create an array of random bytes in Java?

In order to generate random bytes in Java, we use the nextBytes() method. The java. util. Random.

What is difference between Byte and Bytearray?

The difference between bytes() and bytearray() is that bytes() returns an object that cannot be modified, and bytearray() returns an object that can be modified.


2 Answers

The os module provides urandom, even on Windows:

bytearray(os.urandom(1000000)) 

This seems to perform as quickly as you need, in fact, I get better timings than your numpy (though our machines could be wildly different):

timeit.timeit(lambda:bytearray(os.urandom(1000000)), number=10) 0.0554857286941 
like image 148
Ned Batchelder Avatar answered Oct 02 '22 14:10

Ned Batchelder


There are several possibilities, some faster than os.urandom. Also consider whether the data has to be generated deterministically from a random seed. This is invaluable for unit tests where failures have to be reproducible.

short and pithy:

lambda n:bytearray(map(random.getrandbits,(8,)*n))

I've use the above for unit tests and it was fast enough but can it be done faster?

using itertools:

lambda n:bytearray(itertools.imap(random.getrandbits,itertools.repeat(8,n))))

itertools and struct producing 8 bytes per iteration

lambda n:(b''.join(map(struct.Struct("!Q").pack,itertools.imap(     random.getrandbits,itertools.repeat(64,(n+7)//8)))))[:n] 

Anything based on b''.join will fill 3-7x the memory consumed by the final bytearray with temporary objects since it queues up all the sub-strings before joining them together and python objects have lots of storage overhead.

Producing large chunks with a specialized function gives better performance and avoids filling memory.

import random,itertools,struct,operator def randbytes(n,_struct8k=struct.Struct("!1000Q").pack_into):     if n<8000:         longs=(n+7)//8         return struct.pack("!%iQ"%longs,*map(             random.getrandbits,itertools.repeat(64,longs)))[:n]     data=bytearray(n);     for offset in xrange(0,n-7999,8000):         _struct8k(data,offset,             *map(random.getrandbits,itertools.repeat(64,1000)))     offset+=8000     data[offset:]=randbytes(n-offset)     return data 

Performance

  • .84 MB/s :original solution with randint:
  • 4.8 MB/s :bytearray(getrandbits(8) for _ in xrange(n)): (solution by other poster)
  • 6.4MB/s :bytearray(map(getrandbits,(8,)*n))
  • 7.2 MB/s :itertools and getrandbits
  • 10 MB/s :os.urandom
  • 23 MB/s :itertools and struct
  • 35 MB/s :optimised function (holds for len = 100MB ... 1KB)

Note:all tests used 10KB as the string size. Results were consistent up till intermediate results filled memory.

Note:os.urandom is meant to provide secure random seeds. Applications expand that seed with their own fast PRNG. Here's an example, using AES in counter mode as a PRNG:

import os seed=os.urandom(32)  from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes from cryptography.hazmat.backends import default_backend backend = default_backend() cipher = Cipher(algorithms.AES(seed), modes.CTR(b'\0'*16), backend=backend) encryptor = cipher.encryptor()  nulls=b'\0'*(10**5) #100k from timeit import timeit t=timeit(lambda:encryptor.update(nulls),number=10**5) #1GB, (100K*10k) print("%.1f MB/s"%(1000/t)) 

This produces pseudorandom data at 180 MB/s. (no hardware AES acceleration, single core) That's only ~5x the speed of the pure python code above.

Addendum

There's a pure python crypto library waiting to be written. Putting the above techniques together with hashlib and stream cipher techniques looks promising. Here's a teaser, a fast string xor (42MB/s).

def xor(a,b):     s="!%iQ%iB"%divmod(len(a),8)     return struct.pack(s,*itertools.imap(operator.xor,         struct.unpack(s,a),         struct.unpack(s,b))) 
like image 34
Richard Thiessen Avatar answered Oct 02 '22 15:10

Richard Thiessen