Submitting pyspark script to a remote Spark server?

Tags:

This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:

spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)

To run it, I start up a local Spark cluster in Docker:

$ docker run --network=host jupyter/pyspark-notebook

I run the Python script and it connects to this local Spark cluster and all works as expected.

Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?

497

asked Feb 12 '19 01:02

aco

1 Answers

You can create a spark session by specifying the IP address of the remote master.

spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()

In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:

spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()

refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

155

answered Nov 15 '22 06:11

HarryClifton

Related questions
                            
                                How can you update values in a dataset?
                            
                                How to add sparse vectors after group by, using Spark SQL?
                            
                                Understanding Apache Spark RDD task serialization
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                How to compute statistics on a streaming dataframe for different type of columns in a single query?
                            
                                ArrayIndexOutOfBoundsException when reading csv file in spark
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView
                            
                                How to write integration tests for Sparks new Structured Streaming?
                            
                                Spark can't find the application class itself (ClassNotFoundException) in spark-submit with SBT assembly JAR
                            
                                How to read a compressed (gzip) file without extension in Spark
                            
                                Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded
                            
                                Spark: aggregate versus map and reduce
                            
                                How to write dataframe with duplicate column name into a csv file in pyspark
                            
                                chunk topandas from spark dataframe
                            
                                How to get the TypeTag for a class in Java
                            
                                Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize
                            
                                Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;
                            
                                Spark Kryo register for array class
                            
                                How does Round Robin partitioning in Spark work?
                            
                                Why does Spark groupBy.agg(min/max) of BigDecimal always return 0?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Submitting pyspark script to a remote Spark server?

Tags:

apache-spark

pyspark

amazon-emr

aco

People also ask

1 Answers

HarryClifton

Recent Activity

Donate For Us