The code has been vastly simplified, but should serve to illustrate my question.
S = ('A1RT', 'BDF7', 'CP09')
for s in S:
if is_valid(s): # very slow!
process(s)
I have a collection of strings obtained from a site-scrape. (Strings will be retrieved from site-scrapes periodically.) Each of these strings need to be validated, over the network, against a third party. The validation process can be slow at times, which is problematic. Due to the iterative nature of the above code, it may take some time before the last string is validated and processed.
Is there a proper way to parallelize the above logic in Python? To be frank, I'm not very familiar with concurrency / parallel-processing concepts, but it would seem as though they may be useful in this circumstance. Thoughts?
The concurrent.futures module is a great way to start work on "embarrassingly parallel" problems, and can very easily be switched between using either multiple processes or multiple threads within a single process.
In your case, it sounds like the "hard work" is being done on other machines over the network, and your main program will spend most of its time waiting for them to deliver results. If so, threads should work fine. Here's a complete, executable toy example:
import concurrent.futures as cf
def is_valid(s):
import random
import time
time.sleep(random.random() * 10)
return random.choice([False, True])
NUM_WORKERS = 10 # number of threads you want to run
strings = list("abcdefghijklmnopqrstuvwxyz")
with cf.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
# map a future object to the string passed to is_valid
futures = {executor.submit(is_valid, s): s for s in strings}
# `as_complete()` returns results in the order threads
# complete work, _not_ necessarily in the order the work
# was passed out
for future in cf.as_completed(futures):
result = future.result()
print(futures[future], result)
And here's sample output from one run:
g False
i True
j True
b True
f True
e True
k False
h True
c True
l False
m False
a False
s False
v True
q True
p True
d True
n False
t False
z True
o True
y False
r False
w False
u True
x False
concurrent.futures handles all the headaches of starting threads, parceling out work for them to do, and noticing when threads deliver results.
As written above, up through 10 (NUM_WORKERS) is_valid() invocations can be active simultaneously. as_completed() returns a future object as soon as its result is ready to retrieve, and the executor automatically hands the thread that computed the result another string for is_valid() to chew on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With