Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to execute arbitrary python code on spark cluster distributed to workers

I am trying to run a simulation in python using a spark cluster that takes the form of two steps:

  1. Execute (in parallel over a number of worker nodes) a set of python functions. The results of these are to be written out as text files

  2. Collect the results. This will take place using pyspark dataframes.

Is it possible to instruct spark to execute python code on worker nodes in a spark cluster ( i.e. using ordinary python) to achieve this first step? When I try using spark-submit, only commands that are in spark context are executed on the spark cluster. The rest of the python code is executed on the local machine, which I do not want to do.

This answer seems to say no: Using regular python code on a Spark cluster but is not terribly specific.

Example for Clarification

To give an example of step 1, I have a script called draw_from_uniform_distribution.py that does as follows:

import numpy
the_output_file=sys.argv[1] #get output file from command line
the_number=numpy.random.uniform(size=1)
f_out=open(the_output_file,'w')
print(the_number,file=f_out)

I want to run this script 1000 times in parallel on the spark cluster. How do I do so?

like image 503
Josh Avatar asked Mar 08 '26 17:03

Josh


1 Answers

you can take a look how it's done by the spark backend of joblib (https://github.com/joblib/joblib-spark)

the relevant piece of code is the following:

from pyspark.sql import SparkSession
from pyspark import cloudpickle
...

spark = SparkSession.build(...)
spark.sparkContext.parallelize([0], 1)\
.map(lambda: cloudpickle.dumps(your_function()))\
.first()

the function to be run is serialized via pickle and ran in the course of a spark map() operation on the "dummy" RDD (of one element and one partition) provided to pyspark.

like image 62
Joachim Avatar answered Mar 11 '26 05:03

Joachim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!