Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Submitting pyspark script to a remote Spark server?

This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:

spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)

To run it, I start up a local Spark cluster in Docker:

$ docker run --network=host jupyter/pyspark-notebook

I run the Python script and it connects to this local Spark cluster and all works as expected.

Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?

like image 497
aco Avatar asked Feb 12 '19 01:02

aco


People also ask

How do I submit PySpark spark?

Spark Submit Python FileApache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.

How do I run a Python script in spark submit?

Run PySpark Application from spark-submitpy file you wanted to run and you can also specify the . py, . egg, . zip file to spark submit command using --py-files option for any dependencies.

How do you connect to a spark cluster from PySpark?

You can use the spark-submit command installed along with Spark to submit PySpark code to a cluster using the command line. This command takes a PySpark or Scala program and executes it on a cluster.


1 Answers

You can create a spark session by specifying the IP address of the remote master.

spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()

In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:

spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()

refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

like image 155
HarryClifton Avatar answered Nov 15 '22 06:11

HarryClifton