Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to debug the function passed to mapPartitions

how can i deal with the problem that i cant debug my code with print statements inside the function which i pass to mapPartitions() in pyspark?

Consider the example:

def func(kv_iterator):
    for key, value in iterator:
        #do fancy stuff
        print('This print statement does not reach the driver program')
    return [result]

result = someRdd.mapPartitions(func)

Inside func i'd like to do lots of work with iterables and indexing, but i can test my code without having excess to variables inside func.

Is it possible to somehow redirect the print statement from lets say one partition to my driver programm / output channel?

like image 347
JohnnyS Avatar asked Nov 30 '17 21:11

JohnnyS


1 Answers

You can use one of the following:

  • use local mode. All output should be visible in the console. If it is not, your code is probably never executed - try result.count(), result.foreach(lambda _: None), or other action - this is probably the problem here.
  • Redirect stdout (and stderr if you want) to file. For basic prints use file argument:

    print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
    
  • Use remote debugger - How can pyspark be called in debug mode?

But what is most important - test function outside Spark. Function used with mapPartitions should accept Iterable (concrete implementation is usually itertools.chain) and return Iterable.

like image 84
Alper t. Turker Avatar answered Oct 14 '22 09:10

Alper t. Turker