I have a dataset like this (simplified):
foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']
And I have a list of items I want to categorize:
items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']
I want to be able to assign item from the items
list into smaller lists based on their occurrence in the foods_dict
. I think these sublists should actually be sets
as I don't want any duplicates in there.
My first pass at the code was like this:
fruits = set()
veggies = set()
others = set()
for item in items:
if item in foods_dict.get('fruit'):
fruits.add(item)
elif item in foods_dict.get('veg'):
veggies.add(item)
else:
others.add(item)
But this seems really inefficient and needlessly verbose to me. My question is, how could this code be improved? I'm guessing list comprehension could be useful here, but I'm not sure with the number of lists.
For an efficient solution you want to avoid explicit loops as much as possible:
items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies
This will almost surely be faster than using explicit loops. In particular doing item in foods_dict['fruit']
is time consuming if the list of fruits is long.
A very simple benchmark between the solutions so far:
In [5]: %%timeit
...: items2 = set(items)
...: fruits = set(foods_dict['fruit']) & items2
...: veggies = set(foods_dict['veg']) & items2
...: others = items2 - fruits - veggies
...:
1000000 loops, best of 3: 1.75 us per loop
In [6]: %%timeit
...: fruits = set()
...: veggies = set()
...: others = set()
...: for item in items:
...: if item in foods_dict.get('fruit'):
...: fruits.add(item)
...: elif item in foods_dict.get('veg'):
...: veggies.add(item)
...: else:
...: others.add(item)
...:
100000 loops, best of 3: 2.57 us per loop
In [7]: %%timeit
...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
...: others = set(items) - veggies - fruits
...:
100000 loops, best of 3: 3.34 us per loop
Surely, before choosing you should do some tests with "real inputs". I have no idea on the number of elements you have in your problem, and timings may change a lot with bigger inputs. Anyway my experience tells me that, at least in CPython, explicit loops tend to be slower than using only built-in operations.
Edit2: An example with bigger inputs:
In [9]: foods_dict = {}
...: foods_dict['fruit'] = list(range(0, 10000, 2))
...: foods_dict['veg'] = list(range(1, 10000, 2))
In [10]: items = list(range(5, 10000, 13)) #some odd some even
In [11]: %%timeit
...: fruits = set()
...: veggies = set()
...: others = set()
...: for item in items:
...: if item in foods_dict.get('fruit'):
...: fruits.add(item)
...: elif item in foods_dict.get('veg'):
...: veggies.add(item)
...: else:
...: others.add(item)
...:
10 loops, best of 3: 68.8 ms per loop
In [12]: %%timeit
...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
...: others = set(items) - veggies - fruits
...:
10 loops, best of 3: 99.9 ms per loop
In [13]: %%timeit
...: items2 = set(items)
...: fruits = set(foods_dict['fruit']) & items2
...: veggies = set(foods_dict['veg']) & items2
...: others = items2 - fruits - veggies
...:
1000 loops, best of 3: 445 us per loop
As you can see using only built-ins is about 20x faster then explicit loops.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With