I have a fairly complex computational code that I'm trying to speed up and multi-thread. In order to optimize the code, I'm trying to work out which functions are taking the longest or being called the most.
I haven't really profiled code before, so I could be missing something. However, I know many existing profiling modules don't really play nice with numba's njit() decorator due to the recompiling with LLVM.
So my question would be this: What's the best way to profile code in which most functions have the njit() decorator, with a few non-jitted control functions?
I've come across data_profiler before, however it doesn't seem to be in the conda repository anymore and I wouldn't know how to build it from source in conda, or if it would still be compatible with modern versions of its dependencies.
Having spent a few tens of man*years in QuantFX module development, both using numba
and other vectorisation / jit-acceleration tools, let me share a few pieces of experience, that were considered handy for our similarly motivated profiling.
On the contrary of the mentioned data_profiler
, with milliseconds, we enjoyed microsecond resolution provided as side-effect of using a ZeroMQ module, for distributed signalling / messaging infrastructure.
ZeroMQ has all its services implemented in a core-engine, called a Context
, yet there is one small utility free to re-use independently of this instrumentation, a Stopwatch
- a microsecond resolution timer class.
So, nothing could stop us from:
from pyzmq import Stopwatch as MyClock
aClock_A = MyClock(); aClock_B = MyClock(); aClock_C = MyClock(); print( "ACK: A,B,C made" )
# may use 'em when "framing" a code-execution block:
aClock_A.start(); _ = sum( [ aNumOfCollatzConjectureSteps( N ) for N in range( 10**10 ) ] ); TASK_A_us = aClock_A.stop()
print( "INF: Collatz-task took {0:} [us] ".format( TASK_A_us ) )
# may add 'em into call-signatures and pass 'em and/or re-use 'em inside whatever our code
aReturnedVALUE = aNumbaPreCompiledCODE( 1234,
"myCode with a need to profile on several levels",
aClock_A, # several,
aClock_B, # pre-instantiated,
aClock_C # Stopwatch instances, so as
) # to avoid chained latencies
This way one can, if indeed pushed into using at least this, as the tool of the last resort, "hard-wire" into one's own source code any structure of Stopwatch
-based profiling. The only restriction is the need to be conform the finite-state-automaton of the Stopwatch
instance, where once a .start()
method was called, only a .stop()
method may come next and similarly, calling the .stop()
method on a not yet .start()
-ed instance will quite naturally throw an exception.
The common try-except-finally
scaffolding will help to ascertain that all Stopwatch instances happen to become .stop()
-ed again, even if exceptions may have happened.
Structure of "hard-wired" profiling depends on your code-execution "Hot-Spots under test" and even "cross-boundary" profiling of call-related overheads, spent between a native python call of the @jit-decorated numba-LLVM-ed code and starting the 1st line "inside" the numba-compiled code ( i.e. how long does it take between a call-invocation and parameter analyses, driven by a list of call-signatures or principally avoided, by enforcing a single, explicit, call-signature )
Good Luck. Hope it could help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With