The speed bottleneck in my code is a tight double for loop over elements of two arrays, x, and y. A standard hpc trick to improve performance is to do the loop in chunks so that cache misses can be minimized. I am trying to use python generators to do the chunking, but the need to continually recreate the spent generator within the outer for loop is killing my runtime.
Question:
Is there a more intelligent algorithm for constructing the appropriate generator for performing chunked double-for loops?
Concrete illustration:
I'll create two dummy arrays, x, and y. I'll keep them short for illustration, but in practice these are numpy arrays with ~1e6 elements.
x = np.array(['a', 'b', 'b', 'c', 'c', 'd'])
y = np.array(['e', 'f', 'f', 'g'])
The naive double for loop would just be:
for xletter in x:
for yletter in y:
# algebraic manipulations on x & y
Now let's use generators to do this loop in chunks:
chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in range(0, len(x), chunk_size))
for xchunk in xchunk_gen:
ychunk_gen = (y[i: i+chunk_size] for i in range(0, len(y), chunk_size))
for ychunk in ychunk_gen:
for xletter in xchunk:
for yletter in ychunk:
# algebraic manipulations on x & y
Note that in order to implement a generator solution to this problem, I have to continually re-create ychunk_gen within the outer loop. Since y is a large array, this is killing my runtime (for ~1e6 elements, creating this generator takes ~20ms on my laptop).
Is there a way to be more clever about how I am constructing my generators that gets around this problem? Or will it be necessary to just abandon the generator solution altogether?
(Note: In practice, I am using cython to perform this tight loop, but all of the above applies regardless).
In my opinion, the creation of your generator expression is killing your running time because it is not optimized by cython.
A better solution, which keep care of all cache optimization things is to use numexpr. As the manipulation of x and y are algebric it should fits your constraints very well (numexpr can do a little more)
You are defining the ychunk_gen
all over again within the xchunk-loop. Maybe the following will be faster:
chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in xrange(0, len(x), chunk_size))
def ychunk_gen(some_dependency_on_outer_loop):
# use some_dependency_on_outer_loop
for i in xrange(0, len(y), chunk_size):
yield y[i: i+chunk_size]
for xchunk in xchunk_gen:
for ychunk in ychunk_gen(chunk_or_something_else):
for xletter in xchunk:
for yletter in ychunk:
# algebraic manipulations on x & y
But perhaps there is an even better way:
I assume x
and y
are numpy
arrays, so you can reshape the arrays and then loop through every line:
for xchunk in x.reshape((len(x)//chunk_size, chunk_size)):
for ychunk in y.reshape((len(y)//chunk_size, chunk_size)):
# the letter loops
In http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html I read that if you wanted the data not to be copied by reshape
, you should change the shape
-property of the data:
x.shape = len(x)//chunk_size, chunk_size
y.shape = len(y)//chunk_size, chunk_size
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With