After using get
to extract information from an API
in JSON
format, I'm now attempting to calculate an average of price
in an efficient way.
data
(Example response from API Call):
...
{u'status': u'success', u'data': {u'context_id': u'2', u'app_id': u'123', u'sales': [{u'sold_at': 133, u'price': u'1.8500', u'hash_name': u'Xuan881', u'value': u'-1.00000'}, {u'sold_at': 139, u'price': u'2.6100', u'hash_name': u'Xuan881', u'value': u'-1.00000'},
... etc.
I have managed to do so with the following code:
len_sales = len(data["data"]["sales"])
total_p = 0
for i in range(0,len_sales):
total_p += float(data["data"]["sales"][i]["price"])
average = total_p/len_sales
print average
However, since the data
dictionary retrieved is large in size, there seems to be quite a bit of waiting time before the output is shown.
Therefore, I was wondering whether there is a more efficient and/or pythonic way of achieving the same result, but in a shorter time.
First, you're not looping through a dict, you're looping through a list that happens to be inside a dict.
Second, doing something for every value in a list inherently requires visiting every value in the list; there's no way around the linear cost.
So, the only thing available is micro-optimizations, which probably won't make much difference—if your code is way too slow, 10% faster doesn't help, and if your code is already fast enough, you don't need it—but occasionally they are needed.
And in this case, almost all of the micro-optimizations also make your code more readable and Pythonic, so there's no good reason not to do them:
First, you're accessing data["data"]["sales"]
twice. The performance cost of that is probably negligible, but it also makes your code less readable, so let's fix that:
sales = data["data"]["sales"]
Next, instead of looping for i in range(0, len_sales):
just to use sales[i]
, it's faster—and, again, more readable—to just loop over sales
:
for sale in sales:
total_p += float(sale["price"])
And now we can turn this loop into a comprehension, which is slightly more efficient (although that's partly canceled by the cost of adding a generator—you might actually want to test this one):
prices = (float(sale["price"]) for sale in sales)
… and pass that directly to sum
:
total_p = sum(float(sale["price"]) for sale in sales)
We can also use the mean
function that comes with Python instead of doing it manually:
average = statistics.mean(float(sale["price"]) for sale in sales)
… except that you're apparently using Python 2, so you'd need to install the unofficial backport off PyPI (the official stats
backport only goes back to 3.1; the 2.x version was abandoned), so let's skip that part.
Putting it all together:
sales = data["data"]["sales"]
total = sum(float(sale["price"]) for sale in sales)
average = total / len(sales)
A couple things that might help—if it matters, you definitely are going to want to test with timeit
:
You can use operator.itemgetter
to get the price
item. Which means your expression is now just chaining two function calls, which means you can chain two map
calls:
total = sum(map(float, map(operator.itemgetter("price"), sales)))
That's probably less readable than the comprehension to anyone who isn't coming from a Lisp background, but it's certainly not terrible, and it might be a little faster.
Alternatively, for moderately-sized input, building a temporary list is sometimes worth it. Sure, you waste time allocating memory and copying data around, but iterating a list is faster than iterating a generator, so the only way to really be sure is to test.
One more thing that might make a difference is to move this whole thing into a function. Code at the top level doesn't have local variables, only global, and they're slower to look up.
If you really need to squeeze out the last few percentage points, it's sometimes even worth copying global and builtin functions like float
into locals. Of course that isn't going to help with map
(since we're only accessing them once), but with a comprehension it might, so I'll show how to do it anyway:
def total_price(sales):
_float = float
pricegetter = operator.itemgetter("price")
return sum(map(_float, map(pricegetter, sales)))
The best way to benchmark code is to use the timeit
module—or, if you're using IPython, the %timeit
magic. Which works like this:
In [3]: %%timeit
... total_p = 0
... for i in range(0,len_sales):
... total_p += float(data["data"]["sales"][i]["price"])
10000 loops, best of 3: 28.4 µs per loop
In [4]: %timeit sum(float(sale["price"]) for sale in sales)
10000 loops, best of 3: 18.4 µs per loop
In [5]: %timeit sum(map(float, map(operator.itemgetter("price"), sales)))
100000 loops, best of 3: 16.9 µs per loop
In [6]: %timeit sum([float(sale["price"]) for sale in sales])
100000 loops, best of 3: 18.2 µs per loop
In [7]: %timeit total_price(sales)
100000 loops, best of 3: 17.2 µs per loop
So, on my laptop, with your sample data:
sales
and using a generator expression instead of a statement is about 35% faster.map
and itemgetter
instead of a genexpr is about 10% faster.map
, so we're just adding a tiny overhead for probably 0 benefit.)Overall, sum(map(…map(…)))
turned out to be the fasted for this particular input, on my laptop.
But of course you'll want to repeat this test on your real environment with your real input. When differences as small as 10% matter, you can't just assume that the details will transfer.
One more thing: If you really need to speed things up, often the simplest thing to do is take the exact same code and run it in PyPy instead of the usual CPython interpreter. Repeating some of the above tests:
In [4]: %timeit sum(float(sale["price"]) for sale in sales)
680 ns ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit sum(map(float, map(operator.itemgetter("price"), sales)))
800 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit sum([float(sale["price"]) for sale in sales])
694 ns ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now the generator expression version is the fastest—but, more importantly, all three versions are roughly 20x as fast as they were in CPython. A 2000% improvement is a lot better than a 35% improvement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With