Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up numpy array-filling in python?

I'm trying to fill a preallocated bytearray using the following code:

# preallocate a block array
dt = numpy.dtype('u8')
in_memory_blocks = numpy.zeros(_AVAIL_IN_MEMORY_BLOCKS, dt)

...

# write all the blocks out, flushing only as desired
blocks_per_flush_xrange = xrange(0, blocks_per_flush)
for _ in xrange(0, num_flushes):
    for block_index in blocks_per_flush_xrange:
        in_memory_blocks[block_index] = random.randint(0, _BLOCK_MAX)

    print('flushing bytes stored in memory...')

    # commented out for SO; exists in actual code
    # removing this doesn't make an order-of-magnitude difference in time
    # m.update(in_memory_blocks[:blocks_per_flush])

    in_memory_blocks[:blocks_per_flush].tofile(f)

Some points:

  • num_flushes is low, at around 4 - 10
  • blocks_per_flush is a large number, on the order of millions
  • in_memory_blocks can be a fairly large buffer (I've set it as low as 1MB and as high as 100MB) but the timing is very consitent...
  • _BLOCK_MAX is the max for an 8-byte unsigned int
  • m is a hashilib.md5()

Generating 1MB using the above code takes ~1s; 500MB takes ~376s. By comparison, my simple C program that uses rand() can create a 500MB file in 8s.

How can I improve the performance in the above loop? I'm pretty sure I'm ignoring something obvious that's causing this massive difference in runtime.

like image 411
Allen George Avatar asked Apr 15 '11 22:04

Allen George


People also ask

How can I make NumPy array faster?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Is NumPy faster on GPU?

As you can see for small operations, NumPy performs better and as the size increases, tf-numpy provides better performance. And the performance on GPU is way better than its CPU counterpart.

Is it faster to append to NumPy array or list?

It's faster to append list first and convert to array than appending NumPy arrays.

Is NumPy indexing fast?

Furthermore, if the index array has the same shape as the original array, the elements corresponding to True will be selected and put in the resulting array. Indexing in NumPy is a reasonably fast operation. Anyway, when speed is critical, you can use the, slightly faster, numpy.


2 Answers

Due to the fact that 0.._BLOCK_MAX covers all the possible values for numpy.uint8 (I assume that numpy.dtype('u8') (i.e., numpy.uint64 is a typo) you could use:

import numpy as np

for _ in xrange(0, num_flushes):
    in_memory_blocks = np.frombuffer(np.random.bytes(blocks_per_flush),
                                     dtype=np.uint8)

    print('flushing bytes stored in memory...')
    # ...

This variant ~8 times faster than @hgomersall's one:

$ python -mtimeit -s'import numpy as np' '
>     np.uint8(np.random.randint(0,256,20000000))'
10 loops, best of 3: 316 msec per loop

$ python -mtimeit -s'import numpy as np' '
>     np.frombuffer(np.random.bytes(20000000), dtype=np.uint8)'
10 loops, best of 3: 38.6 msec per loop

If numpy.dtype('u8') is not a typo and you indeed require numpy.uint64 then:

a = np.int64(np.random.random_integers(0, _BLOCK_MAX, blocks_per_flush))
in_memory_blocks = a.view(np.uint64) # unsigned

Note: np.int64() doesn't make a copy if the array's dtype is already np.int64. .view(numpy.uint64) forces its interpretation as unsigned (also no copy is performed).

like image 131
jfs Avatar answered Sep 22 '22 10:09

jfs


Since you're allocating contiguous blocks, you should be able to do the following (getting rid of the inner loop entirely):

for _ in xrange(0, num_flushes):
    in_memory_blocks[:blocks_per_flush] = numpy.random.randint(
            0, _BLOCK_MAX+1, blocks_per_flush)

    print('flushing bytes stored in memory...')

    # commented out for SO; exists in actual code
    # removing this doesn't make an order-of-magnitude difference in time
    # m.update(in_memory_blocks[:blocks_per_flush])

    in_memory_blocks[:blocks_per_flush].tofile(f)

This uses the numpy.random.randint function which allocates a whole block of memory and fills it with random integers (note J.F. Sebastian's comment below about numpy.random.randint versus random.randint). There's no way (as far as I can see) to fill a preallocated array using the numpy random routines. The other problem is that numpy's randint returns int64 arrays. If you need integers of some other size then you can use the numpy typing methods, for example numpy.uint8. If you want randints to cover the whole range of the type, then @J. F. Sebastian's method below using numpy.random.bytes is going to be the best (in almost any case!).

However, simple tests show reasonable times (of the same order of magnitude as the C code). The following code tests the time to allocate uint8 arrays of 20,000,000 random integers using the numpy method:

from timeit import Timer
t = Timer(stmt='a=numpy.uint8(numpy.random.randint(0, 100, 20000000))',
        setup='import numpy')
test_runs = 50
time = t.timeit(test_runs)/test_runs
print time

I get it taking about 0.7 seconds per allocation on my 4 year old Core2 laptop (it runs 50 times so it'll take longer to run the whole test). That's 0.7 s per allocation of 20,000,000 random uint8 integers, so I'd expect something around 20s for the whole 500MB.

More memory would mean you could allocate bigger chunks at once, but you're still effectively wasting time allocating and writing 64 bits for each int when you only need 8 (I haven't quantified this effect). If its still not fast enough, you could call your C implementation using the numpy ctypes interface. This is really rather easy to use and you'd get virtually no slowdown over pure C.

The general take home message is that with numpy, always try to use the numpy routines where they exist, remembering that falling back to C with ctypes is not too painful. In general, this methodology allows really quite effective use of python with very little slowdown for numerical processing.

Edit: Something else that just occurred to me: as its implemented above, I think you'd be making an additional unnecessary copy. If in_memory_blocks is of length blocks_per_flush, then you'd do better just to assign it the return from numpy.random.randint, rather than allocate it to a certain subarray (which in the general case must be a copy). So:

in_memory_blocks = numpy.random.randint(0, _BLOCK_MAX+1, blocks_per_flush)

rather than:

in_memory_blocks[:blocks_per_flush] = numpy.random.randint(
        0, _BLOCK_MAX+1, blocks_per_flush)

However, having timed this, the first case doesn't lead to a significant increase in speed (only about 2%), so its probably not worth worrying about too much. I guess the overwhelming amount of time is spent actually generating random numbers (which is what I would expect).

like image 35
Henry Gomersall Avatar answered Sep 19 '22 10:09

Henry Gomersall