Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to execute a for loop in batches?

for x in records:
   data = {}
   for y in sObjectName.describe()['fields']
         data[y['name']] = x[y['name']]
   ls.append(adapter.insert_posts(collection, data))

I want to execute the code ls.append(adapter.insert_post(collection, x)) in the batch size of 500, where x should contain 500 data dicts. I could create a list a of 500 data dicts using a double for loop and a list and then insert it. I could do that in the following way, , is there a better way to do it? :

for x in records:
    for i in xrange(0,len(records)/500):
        for j in xrange(0,500):
            l=[]
            data = {}
            for y in sObjectName.describe()['fields']:
                data[y['name']] = x[y['name']]
                #print data
            #print data
            l.append(data)
        ls.append(adapter.insert_posts(collection, data))

    for i in xrange(0,len(records)%500):
        l=[]
        data = {}
        for y in sObjectName.describe()['fields']:
            data[y['name']] = x[y['name']]
            #print data
        #print data
        l.append(data)
    ls.append(adapter.insert_posts(collection, data))
like image 353
Mudits Avatar asked May 29 '14 05:05

Mudits


3 Answers

The general structure I use looks like this:

worklist = [...]
batchsize = 500

for i in range(0, len(worklist), batchsize):
    batch = worklist[i:i+batchsize] # the result might be shorter than batchsize at the end
    # do stuff with batch

Note that we're using the step argument of range to simplify the batch processing considerably.

like image 120
nneonneo Avatar answered Oct 15 '22 23:10

nneonneo


If you're working with sequences, the solution by @nneonneo is about as performant as you can get. If you want a solution which works with arbitrary iterables, you can look into some of the itertools recipes. e.g. grouper:

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)

I tend to not use this one because it "fills" the last group with None so that it is the same length as the others. I usually define my own variant which doesn't have this behavior:

def grouper2(iterable, n):
    iterable = iter(iterable)
    while True:
        tup = tuple(itertools.islice(iterable, 0, n))
        if tup:
            yield tup
        else:
            break

This yields tuples of the requested size. This is generally good enough, but, for a little fun we can write a generator which returns lazy iterables of the correct size if we really want to...

The "best" solution here I think depends a bit on the problem at hand -- particularly the size of the groups and objects in the original iterable and the type of the original iterable. Generally, these last 2 recipes will find less use because they're more complex and rarely needed. However, If you're feeling adventurous and in the mood for a little fun, read on!


The only real modification that we need to get a lazy iterable instead of a tuple is the ability to "peek" at the next value in the islice to see if there is anything there. here I just peek at the value -- If it's missing, StopIteration will be raised which will stop the generator just as if it had ended normally. If it's there, I put it back using itertools.chain:

def grouper3(iterable, n):
    iterable = iter(iterable)
    while True:
        group = itertools.islice(iterable, n)
        item = next(group)  # raises StopIteration if the group doesn't yield anything
        yield itertools.chain((item,), group)

Careful though, this last function only "works" if you completely exhaust each iterable yielded before moving on to the next one. In the extreme case where you don't exhaust any of the iterables, e.g. list(grouper3(..., n)), you'll get "m" iterables which yield only 1 item, not n (where "m" is the "length" of the input iterable). This behavior could actually be useful sometimes, but not typically. We can fix that too if we use the itertools "consume" recipe (which also requires importing collections in addition to itertools):

def grouper4(iterable, n):
    iterable = iter(iterable)
    group = []
    while True:
        collections.deque(group, maxlen=0)  # consume all of the last group
        group = itertools.islice(iterable, n)
        item = next(group)  # raises StopIteration if the group doesn't yield anything
        group = itertools.chain((item,), group)
        yield group

Of course, list(grouper4(..., n)) will return empty iterables -- Any value not pulled from the "group" before the next invocation of next (e.g. when the for loop cycles back to the start) will never get yielded.

like image 26
mgilson Avatar answered Oct 15 '22 23:10

mgilson


I like @nneonneo and @mgilson's answers but doing this over and over again is tedious. The bottom of the itertools page in python3 mentions the library more-itertools (I know this question was about python2 and this is python3 library, but some might find this useful). The following seems to do what you ask:

from more_itertools import chunked # Note: you might also want to look at ichuncked

for batch in chunked(records, 500):
    # Do the work--`batch` is a list of 500 records (or less for the last batch).  
like image 20
Tim Ludwinski Avatar answered Oct 15 '22 23:10

Tim Ludwinski