Python

Question

I have a dataset like this (simplified):

foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']

And I have a list of items I want to categorize:

items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']

I want to be able to assign item from the items list into smaller lists based on their occurrence in the foods_dict. I think these sublists should actually be sets as I don't want any duplicates in there.

My first pass at the code was like this:

fruits = set()
veggies = set()
others = set()
for item in items:
    if item in foods_dict.get('fruit'):
        fruits.add(item)
    elif item in foods_dict.get('veg'):
        veggies.add(item)
    else:
        others.add(item)

But this seems really inefficient and needlessly verbose to me. My question is, how could this code be improved? I'm guessing list comprehension could be useful here, but I'm not sure with the number of lists.

Bakuriu · Accepted Answer

For an efficient solution you want to avoid explicit loops as much as possible:

items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies

This will almost surely be faster than using explicit loops. In particular doing item in foods_dict['fruit'] is time consuming if the list of fruits is long.

A very simple benchmark between the solutions so far:

In [5]: %%timeit
   ...: items2 = set(items)
   ...: fruits = set(foods_dict['fruit']) & items2
   ...: veggies = set(foods_dict['veg']) & items2
   ...: others = items2 - fruits - veggies
   ...: 
1000000 loops, best of 3: 1.75 us per loop

In [6]: %%timeit
   ...: fruits = set()
   ...: veggies = set()
   ...: others = set()
   ...: for item in items:
   ...:     if item in foods_dict.get('fruit'):
   ...:         fruits.add(item)
   ...:     elif item in foods_dict.get('veg'):
   ...:         veggies.add(item)
   ...:     else:
   ...:         others.add(item)
   ...: 
100000 loops, best of 3: 2.57 us per loop

In [7]: %%timeit
   ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
   ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
   ...: others = set(items) - veggies - fruits
   ...: 
100000 loops, best of 3: 3.34 us per loop

Surely, before choosing you should do some tests with "real inputs". I have no idea on the number of elements you have in your problem, and timings may change a lot with bigger inputs. Anyway my experience tells me that, at least in CPython, explicit loops tend to be slower than using only built-in operations.

Edit2: An example with bigger inputs:

In [9]: foods_dict = {}
   ...: foods_dict['fruit'] = list(range(0, 10000, 2))
   ...: foods_dict['veg'] = list(range(1, 10000, 2))

In [10]: items = list(range(5, 10000, 13))  #some odd some even

In [11]: %%timeit
    ...: fruits = set()
    ...: veggies = set()
    ...: others = set()
    ...: for item in items:
    ...:     if item in foods_dict.get('fruit'):
    ...:         fruits.add(item)
    ...:     elif item in foods_dict.get('veg'):
    ...:         veggies.add(item)
    ...:     else:
    ...:         others.add(item)
    ...: 
10 loops, best of 3: 68.8 ms per loop

In [12]: %%timeit
    ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
    ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
    ...: others = set(items) - veggies - fruits
    ...: 
10 loops, best of 3: 99.9 ms per loop

In [13]: %%timeit
    ...: items2 = set(items)
    ...: fruits = set(foods_dict['fruit']) & items2
    ...: veggies = set(foods_dict['veg']) & items2
    ...: others = items2 - fruits - veggies
    ...: 
1000 loops, best of 3: 445 us per loop

As you can see using only built-ins is about 20x faster then explicit loops.

Python - categorizing items in a list based on occurrence in a dictionary of lists

Tags:

dictionary

list

Darwin Tech

1 Answers

Bakuriu

Recent Activity

Donate For Us