I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.
Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.
Occasionally, one of the plugins will take minutes to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.
This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.
Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?
ADDED: In answer to some of the comments:
Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to rog [last comment]).
The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now.
That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.
You apparently don't need multithreading, only concurrency because your threads don't share any state :
Try multiprocessing instead of multithreading
Single thread / N subprocesses. There you can time each request, since no GIL is hold.
Other possibility is to get rid of multiple execution threads and use event-based network programming (ie use twisted)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With