Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time-intensive collection processing in Python

The code has been vastly simplified, but should serve to illustrate my question.

S = ('A1RT', 'BDF7', 'CP09')
for s in S:
    if is_valid(s): # very slow!
        process(s)

I have a collection of strings obtained from a site-scrape. (Strings will be retrieved from site-scrapes periodically.) Each of these strings need to be validated, over the network, against a third party. The validation process can be slow at times, which is problematic. Due to the iterative nature of the above code, it may take some time before the last string is validated and processed.

Is there a proper way to parallelize the above logic in Python? To be frank, I'm not very familiar with concurrency / parallel-processing concepts, but it would seem as though they may be useful in this circumstance. Thoughts?

like image 707
kylemart Avatar asked Mar 01 '26 05:03

kylemart


1 Answers

The concurrent.futures module is a great way to start work on "embarrassingly parallel" problems, and can very easily be switched between using either multiple processes or multiple threads within a single process.

In your case, it sounds like the "hard work" is being done on other machines over the network, and your main program will spend most of its time waiting for them to deliver results. If so, threads should work fine. Here's a complete, executable toy example:

import concurrent.futures as cf

def is_valid(s):
    import random
    import time
    time.sleep(random.random() * 10)
    return random.choice([False, True])

NUM_WORKERS = 10  # number of threads you want to run

strings = list("abcdefghijklmnopqrstuvwxyz")

with cf.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
    # map a future object to the string passed to is_valid
    futures = {executor.submit(is_valid, s): s for s in strings}
    # `as_complete()` returns results in the order threads
    # complete work, _not_ necessarily in the order the work
    # was passed out
    for future in cf.as_completed(futures):
        result = future.result()
        print(futures[future], result)

And here's sample output from one run:

g False
i True
j True
b True
f True
e True
k False
h True
c True
l False
m False
a False
s False
v True
q True
p True
d True
n False
t False
z True
o True
y False
r False
w False
u True
x False

concurrent.futures handles all the headaches of starting threads, parceling out work for them to do, and noticing when threads deliver results.

As written above, up through 10 (NUM_WORKERS) is_valid() invocations can be active simultaneously. as_completed() returns a future object as soon as its result is ready to retrieve, and the executor automatically hands the thread that computed the result another string for is_valid() to chew on.

like image 54
Tim Peters Avatar answered Mar 03 '26 13:03

Tim Peters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!