Suppose you have got a list comprehension in python, like
Values = [ f(x) for x in range( 0, 1000 ) ]
with f being just a function without side effects. So all the entries can be computed independently.
Is Python able to increase the performance of this list comprehension compared with the "obvious" implementation; e.g. by shared-memory-parallelization on multicore CPUs?
Time Saved with List Comprehension. Because of differences in how Python implements for loops and list comprehension, list comprehensions are almost always faster than for loops when performing operations.
List comprehensions are faster than for loops to create lists. But, this is because we are creating a list by appending new elements to it at each iteration.
The list comprehension is 50% faster.
List comprehensions are useful and can help you write elegant code that's easy to read and debug, but they're not the right choice for all circumstances. They might make your code run more slowly or use more memory.
In Python 3.2 they added concurrent.futures
, a nice library for solving problems concurrently. Consider this example:
import math, time
from concurrent import futures
PRIMES = [112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419, 112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def bench(f):
start = time.time()
f()
elapsed = time.time() - start
print("Completed in {} seconds".format(elapsed))
def concurrent():
with futures.ProcessPoolExecutor() as executor:
values = list(executor.map(is_prime, PRIMES))
def listcomp():
values = [is_prime(x) for x in PRIMES]
Results on my quad core:
>>> bench(listcomp)
Completed in 14.463825941085815 seconds
>>> bench(concurrent)
Completed in 3.818351984024048 seconds
No, Python will not magically parallelize this for you. In fact, it can't, since it cannot prove the independence of the entries; that would require a great deal of program inspection/verification, which is impossible to get right in the general case.
If you want quick coarse-grained multicore parallelism, I recommend joblib instead:
from joblib import delayed, Parallel
values = Parallel(n_jobs=NUM_CPUS)(delayed(f)(x) for x in range(1000))
Not only have I witnessed near-linear speedups using this library, it also has the great feature of signals such as the one from Ctrl-C onto its worker processes, which cannot be said of all multiprocess libraries.
Note that joblib doesn't really support shared-memory parallelism: it spawns worker processes, not threads, so it incurs some communication overhead from sending data to workers and results back to the master process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With