Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimise filtering lists in Python 2.7

I need to filter a large lists several times, but I'm concerned with both simplicity of code and execution efficiency. To give an example:

all_things # huge collection of all things

# inefficient but clean code
def get_clothes():
    return filter(lambda t: t.garment, allThings)

def get_hats():
    return filter(lambda t: t.headgear, get_clothes())

I'm concerned that I'm iterating over the clothes list when in fact it has already been iterated over. I also want to keep the two filter operations separate, as they belong to two different classes, and I do not want to duplicate the first lambda function in the hats class.

# efficient but duplication of code
def get_clothes():
    return filter(lambda t: t.garment, allThings)

def get_hats():
    return filter(lambda t: t.headgear and t.garment, allThings)

I have been investigating generator functions, as they seemed like the way to go, but I haven't as yet figure out how.

like image 305
cammil Avatar asked Apr 27 '12 11:04

cammil


2 Answers

First of all using filter/lambda combination is going to be deprecated. Current functional programming style is described in Python Functional Programming HOWTO.

Secondly, if you concerned with efficiency, rather than construct lists, you should return generators. In this case they are simple enough to use generator expressions.

def get_clothes():
    return (t for t in allThings if t.garment)

def get_hats():
    return (t for t in get_clothes() if t.headgear)

Or if you'd prefer, true generators (allegedly more pythonic):

def get_clothes():
    for t in allThings:
       if t.garment:
           yield t

def get_hats():
    for t in get_clothes():
        if t.headgear:
            yield t

If for some reason, sometimes you need list rather than iterator, you can construct list by simple casting:

hats_list = list(get_hats())

Note, that above will not construct list of clothes, thus efficiency is close to your duplicate code version.

like image 194
vartec Avatar answered Oct 20 '22 20:10

vartec


I was looking for similar filtering of lists but wanted to have a slightly different format to what was presented here.

The get_hats() call above is good but limited in its reuse. I was looking for something more like get_hats(get_clothes(all_things)), where you can specify a source (all_things), and then as few or as many levels of filters get_hats(), get_clothes() as you want.

I found a way to do that with generators:

def get_clothes(in_list):
    for item in in_list:
        if item.garment:
            yield item

def get_hats(in_list):
    for item in in_list:
        if item.headgear:
            yield item

This can then be called by:

get_hats(get_clothes(all_things))

I tested the original solutions, vartec's solution and this additional solution to see the efficiency, and was somewhat surprised by the results. Code as follows:

Setup:

class Thing:
    def __init__(self):
        self.garment = False
        self.headgear = False

all_things = [Thing() for i in range(1000000)]

for i, thing in enumerate(all_things):
    if i % 2 == 0:
        thing.garment = True
    if i % 4 == 0:
        thing.headgear = True

Original solutions:

def get_clothes():
    return filter(lambda t: t.garment, all_things)

def get_hats():
    return filter(lambda t: t.headgear, get_clothes())

def get_clothes2():
    return filter(lambda t: t.garment, all_things)

def get_hats2():
    return filter(lambda t: t.headgear and t.garment, all_things)

My solution:

def get_clothes3(in_list):
    for item in in_list:
        if item.garment:
            yield item

def get_hats3(in_list):
    for item in in_list:
        if item.headgear:
            yield item

vartec's solution:

def get_clothes4():
    for t in all_things:
       if t.garment:
           yield t

def get_hats4():
    for t in get_clothes4():
        if t.headgear:
            yield t

Timing code:

import timeit

print 'get_hats()'
print timeit.timeit('get_hats()', 'from __main__ import get_hats', number=1000)

print 'get_hats2()'
print timeit.timeit('get_hats2()', 'from __main__ import get_hats2', number=1000)

print '[x for x in get_hats3(get_clothes3(all_things))]'
print timeit.timeit('[x for x in get_hats3(get_clothes3(all_things))]',
                    'from __main__ import get_hats3, get_clothes3, all_things',
                    number=1000)

print '[x for x in get_hats4()]'
print timeit.timeit('[x for x in get_hats4()]',
                    'from __main__ import get_hats4', number=1000)

Results:

get_hats()
379.334653854
get_hats2()
232.768362999
[x for x in get_hats3(get_clothes3(all_things))]
214.376812935
[x for x in get_hats4()]
218.250688076

The generator expressions appear to be slightly faster, the difference in time between my and vartec's solutions are probably just noise. But I prefer the flexibility of being able to apply whatever filters are required in whatever order.

like image 39
Allan5 Avatar answered Oct 20 '22 20:10

Allan5