Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a more pythonic/more efficient way to loop through dictionary containing lists rather than using for loops?

After using get to extract information from an API in JSON format, I'm now attempting to calculate an average of price in an efficient way.

data (Example response from API Call):

...
{u'status': u'success', u'data': {u'context_id': u'2', u'app_id': u'123', u'sales': [{u'sold_at': 133, u'price': u'1.8500', u'hash_name': u'Xuan881', u'value': u'-1.00000'}, {u'sold_at': 139, u'price': u'2.6100', u'hash_name': u'Xuan881', u'value': u'-1.00000'},
... etc.

I have managed to do so with the following code:

len_sales = len(data["data"]["sales"])
total_p = 0 
for i in range(0,len_sales):
    total_p += float(data["data"]["sales"][i]["price"])
average = total_p/len_sales
print average

However, since the data dictionary retrieved is large in size, there seems to be quite a bit of waiting time before the output is shown.

Therefore, I was wondering whether there is a more efficient and/or pythonic way of achieving the same result, but in a shorter time.

like image 921
ThatOneNoob Avatar asked Jan 27 '23 17:01

ThatOneNoob


1 Answers

First, you're not looping through a dict, you're looping through a list that happens to be inside a dict.

Second, doing something for every value in a list inherently requires visiting every value in the list; there's no way around the linear cost.

So, the only thing available is micro-optimizations, which probably won't make much difference—if your code is way too slow, 10% faster doesn't help, and if your code is already fast enough, you don't need it—but occasionally they are needed.

And in this case, almost all of the micro-optimizations also make your code more readable and Pythonic, so there's no good reason not to do them:


First, you're accessing data["data"]["sales"] twice. The performance cost of that is probably negligible, but it also makes your code less readable, so let's fix that:

sales = data["data"]["sales"]

Next, instead of looping for i in range(0, len_sales): just to use sales[i], it's faster—and, again, more readable—to just loop over sales:

for sale in sales:
    total_p += float(sale["price"])

And now we can turn this loop into a comprehension, which is slightly more efficient (although that's partly canceled by the cost of adding a generator—you might actually want to test this one):

prices = (float(sale["price"]) for sale in sales)

… and pass that directly to sum:

total_p = sum(float(sale["price"]) for sale in sales)

We can also use the mean function that comes with Python instead of doing it manually:

average = statistics.mean(float(sale["price"]) for sale in sales)

… except that you're apparently using Python 2, so you'd need to install the unofficial backport off PyPI (the official stats backport only goes back to 3.1; the 2.x version was abandoned), so let's skip that part.

Putting it all together:

sales = data["data"]["sales"]
total = sum(float(sale["price"]) for sale in sales)
average = total / len(sales)

A couple things that might help—if it matters, you definitely are going to want to test with timeit:

You can use operator.itemgetter to get the price item. Which means your expression is now just chaining two function calls, which means you can chain two map calls:

total = sum(map(float, map(operator.itemgetter("price"), sales)))

That's probably less readable than the comprehension to anyone who isn't coming from a Lisp background, but it's certainly not terrible, and it might be a little faster.


Alternatively, for moderately-sized input, building a temporary list is sometimes worth it. Sure, you waste time allocating memory and copying data around, but iterating a list is faster than iterating a generator, so the only way to really be sure is to test.


One more thing that might make a difference is to move this whole thing into a function. Code at the top level doesn't have local variables, only global, and they're slower to look up.

If you really need to squeeze out the last few percentage points, it's sometimes even worth copying global and builtin functions like float into locals. Of course that isn't going to help with map (since we're only accessing them once), but with a comprehension it might, so I'll show how to do it anyway:

def total_price(sales):
    _float = float
    pricegetter = operator.itemgetter("price")
    return sum(map(_float, map(pricegetter, sales)))

The best way to benchmark code is to use the timeit module—or, if you're using IPython, the %timeit magic. Which works like this:

In [3]: %%timeit
... total_p = 0 
... for i in range(0,len_sales):
...     total_p += float(data["data"]["sales"][i]["price"])
10000 loops, best of 3: 28.4 µs per loop
In [4]: %timeit sum(float(sale["price"]) for sale in sales)
10000 loops, best of 3: 18.4 µs per loop
In [5]: %timeit sum(map(float, map(operator.itemgetter("price"), sales)))
100000 loops, best of 3: 16.9 µs per loop
In [6]: %timeit sum([float(sale["price"]) for sale in sales])
100000 loops, best of 3: 18.2 µs per loop
In [7]: %timeit total_price(sales)
100000 loops, best of 3: 17.2 µs per loop

So, on my laptop, with your sample data:

  • Looping directly over sales and using a generator expression instead of a statement is about 35% faster.
  • Using a list comprehension instead of a genexpr is about 1% faster than that.
  • Using map and itemgetter instead of a genexpr is about 10% faster.
  • Wrapping it in a function and caching the locals made things slightly slower. (Not surprising—as mentioned above, we only had a single lookup for each name anyway, thanks to map, so we're just adding a tiny overhead for probably 0 benefit.)

Overall, sum(map(…map(…))) turned out to be the fasted for this particular input, on my laptop.

But of course you'll want to repeat this test on your real environment with your real input. When differences as small as 10% matter, you can't just assume that the details will transfer.


One more thing: If you really need to speed things up, often the simplest thing to do is take the exact same code and run it in PyPy instead of the usual CPython interpreter. Repeating some of the above tests:

In [4]: %timeit sum(float(sale["price"]) for sale in sales)
680 ns ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit sum(map(float, map(operator.itemgetter("price"), sales)))
800 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit sum([float(sale["price"]) for sale in sales])
694 ns ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now the generator expression version is the fastest—but, more importantly, all three versions are roughly 20x as fast as they were in CPython. A 2000% improvement is a lot better than a 35% improvement.

like image 104
abarnert Avatar answered Jan 30 '23 07:01

abarnert