Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to time a transformation in Spark, given lazy execution style?

I have some code that is executing quite a few steps, and I know how long the whole process takes. However, I would like to be able to calculate how long each individual transformation takes. Here are some simple examples of the steps:

rdd1 = sc.textFile("my/filepath/*").map(lambda x: x.split(","))
rdd2 = sc.textFile("other/filepath/*").map(lambda x: x.split(","))
to_kv1 = rdd1.map(lambda x: (x[0], x[1])) # x[0] is key, x[1] is numeric
to_kv2 = rdd2.map(lambda x: (x[0], x[1])) # x[0] is key, x[1] is numeric
reduced1 = to_kv1.reduceByKey(lambda a, b: a+b)
reduced2 = to_kv1.reduceByKey(lambda a, b: a+b)
outer_joined = reduced1.fullOuterJoin(reduced2) # let's just assume there is key overlap
outer_joined.saveAsTextFile("my_output")

Now: how do I benchmark a specific part of this code? I know that running it end to end takes a certain amount of time (the saveAsTextFile will force it to execute), but how do I benchmark only the reduceByKey or fullOuterJoin part of the code? I know I could run count() after each operation to force execution, but that would not properly benchmark the operation since it is adding the time it takes to perform the count as well as the time to perform the transformation.

What is the best way to benchmark Spark transformations, given their lazy execution style?

Please note that I am NOT asking how to measure time. I know about the time module, start = time.time(), etc. I am asking how to benchmark given the lazy execution style of Spark transformations, which do not execute until you call an action that requires information to be returned to the driver.

like image 501
Katya Willard Avatar asked Mar 28 '16 19:03

Katya Willard


1 Answers

Your best bet is to use the Spark UI to read this information. The problem is two-fold:

  • The computations are distributed so even if you added a timing mechanism inside each of the transformations, it would be somewhat difficult to tell when the task is truly done as it could be done in one machine, but not another. That said, you COULD add logging inside and find the first instance of the execution and then find the final execution. Keep in mind the next point though
  • The transformations are pipelined as much as possible. So, Spark will run multiple transformations at the same time for efficiency, so you'd have to explicitly test JUST that one action.
like image 148
Justin Pihony Avatar answered Nov 01 '22 15:11

Justin Pihony