I have some code that is executing quite a few steps, and I know how long the whole process takes. However, I would like to be able to calculate how long each individual transformation takes. Here are some simple examples of the steps:
rdd1 = sc.textFile("my/filepath/*").map(lambda x: x.split(","))
rdd2 = sc.textFile("other/filepath/*").map(lambda x: x.split(","))
to_kv1 = rdd1.map(lambda x: (x[0], x[1])) # x[0] is key, x[1] is numeric
to_kv2 = rdd2.map(lambda x: (x[0], x[1])) # x[0] is key, x[1] is numeric
reduced1 = to_kv1.reduceByKey(lambda a, b: a+b)
reduced2 = to_kv1.reduceByKey(lambda a, b: a+b)
outer_joined = reduced1.fullOuterJoin(reduced2) # let's just assume there is key overlap
outer_joined.saveAsTextFile("my_output")
Now: how do I benchmark a specific part of this code? I know that running it end to end takes a certain amount of time (the saveAsTextFile
will force it to execute), but how do I benchmark only the reduceByKey
or fullOuterJoin
part of the code? I know I could run count()
after each operation to force execution, but that would not properly benchmark the operation since it is adding the time it takes to perform the count
as well as the time to perform the transformation.
What is the best way to benchmark Spark transformations, given their lazy execution style?
Please note that I am NOT asking how to measure time. I know about the time
module, start = time.time()
, etc. I am asking how to benchmark given the lazy execution style of Spark transformations, which do not execute until you call an action that requires information to be returned to the driver.
Your best bet is to use the Spark UI to read this information. The problem is two-fold:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With