how can i deal with the problem that i cant debug my code with print statements inside the function which i pass to mapPartitions() in pyspark?
Consider the example:
def func(kv_iterator):
for key, value in iterator:
#do fancy stuff
print('This print statement does not reach the driver program')
return [result]
result = someRdd.mapPartitions(func)
Inside func i'd like to do lots of work with iterables and indexing, but i can test my code without having excess to variables inside func.
Is it possible to somehow redirect the print statement from lets say one partition to my driver programm / output channel?
You can use one of the following:
local
mode. All output should be visible in the console. If it is not, your code is probably never executed - try result.count()
, result.foreach(lambda _: None)
, or other action - this is probably the problem here.Redirect stdout (and stderr if you want) to file. For basic prints
use file
argument:
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
Use remote debugger - How can pyspark be called in debug mode?
But what is most important - test function outside Spark. Function used with mapPartitions
should accept Iterable
(concrete implementation is usually itertools.chain
) and return Iterable
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With