Run Command on EMR Slaves?

Tags:

I'm trying to update a running EMR cluster with pip install on all the slave machines. How can I do that?

I can't do it with a bootstrap step because it is a long running EMR and I can't take it down.

The EMR cluster is running Spark & Yarn, so I would normally use spark slaves.sh, but I can't find that script on the master node. Is it installed in a place I haven't found? Or is there some way to install it?

I've seen other questions that say use yarn distributed-shell, but I can't find a working example of how to do that.

BTW, the cluster is using EMR 4.8.0, Spark 1.6.1, I believe.

361

asked Nov 30 '16 20:11

enigmaticdatajunkie

1 Answers

You can run yarn command from your nodes to get the list of all nodes and you might use SSH to run commands on all those nodes. Like in the article mentioned before, you can run something like

#Copy ssh key(like ssh_key.pem) of the cluster to master node.
aws s3 cp s3://bucket/ssh_key.pem ~/

# change permissions to read 
chmod 400 ssh_key.pem

# Run a PIP command
yarn node -list|sed -n "s/^\(ip[^:]*\):.*/\1/p" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no -i ~/ssh_key.pem hadoop@{} "pip install package"

answered Jan 05 '23 02:01

jc mannem

Related questions
                            
                                How to get data from a specific partition in Spark RDD?
                            
                                Access to Spark from Flask app
                            
                                Number of Partitions of Spark Dataframe
                            
                                Docker Container with Apache Spark in standalone cluster mode
                            
                                How to use a subquery for dbtable option in jdbc data source?
                            
                                Why there are many spark-warehouse folders got created?
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?
                            
                                Spark streaming with python: how to add a UUID column?
                            
                                Difference between batch interval, sliding interval and window size in spark streaming
                            
                                Failed to find data source: com.mongodb.spark.sql.DefaultSource
                            
                                Can I tell spark.read.json that my files are gzipped?
                            
                                How to use spark-avro package to read avro file from spark-shell?
                            
                                Enriching SparkContext without incurring in serialization issues
                            
                                spark reading large file
                            
                                Using Silhouette Clustering in Spark
                            
                                Convert value depending on a type in SparkSQL via case matching of type
                            
                                How to flatten nested lists in PySpark?
                            
                                How to force Spark to evaluate DataFrame operations inline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Run Command on EMR Slaves?

Tags:

apache-spark

hadoop-yarn

emr

amazon-emr

enigmaticdatajunkie

People also ask

1 Answers

jc mannem

Recent Activity

Donate For Us