I installed Spark 3.x on my system and submitting the spark command in yarn cluster mode from my Unix terminal and getting the below error.
(base) vijee@vijee-Lenovo-IdeaPad-S510p:~/spark-3.0.1-bin-hadoop2.7$ bin/spark-submit --master yarn --deploy-mode cluster --py-files /home/vijee/Python/PythonScriptOnYARN.py
and throwing below error
Error: Missing application resource
Here is the full details of my query:
(base) vijee@vijee-Lenovo-IdeaPad-S510p:~/spark-3.0.1-bin-hadoop2.7$ bin/spark-submit --master yarn --deploy-mode "cluster" --driver-memory 1G --executor-cores 2 --num-executors 1 --executor-memory 2g --py-files /home/vijee/Python/PythonScriptOnYARN.py
Error: Missing application resource.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/vijee/spark-3.0.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/vijee/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories are given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
.
.
.
Find below are my configuration setting:
cat spark-env.sh
export HADOOP_HOME=/home/vijee/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH=/home/vijee/hadoop-2.7.7/etc/hadoop:/home/vijee/hadoop-2.7.7/share/hadoop/common/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/common/*:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/hdfs/*:/home/vijee/hadoop-2.7.7/share/hadoop/yarn:/home/vijee/hadoop-2.7.7/share/hadoop/yarn/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/yarn/*:/home/vijee/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/home/vijee/hadoop-2.7.7/share/hadoop/mapreduce/*:/home/vijee/hadoop-2.7.7/contrib/capacity-scheduler/*.jar
cat .bashrc
export HADOOP_HOME=/home/vijee/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/home/vijee/spark-3.0.1-bin-hadoop2.7
export PATH="$PATH:/home/vijee/spark-3.0.1-bin-hadoop2.7/bin"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/home/vijee/anaconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=/home/vijee/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
cat spark-defaults.cong
spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m
# Configure history server
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/sparklogs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://localhost:9000/sparklogs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.yarn.security.tokens.hive.enabled true
cat PythonScriptOnYARN.py
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import Row
SS = SparkSession.builder.master("yarn").appName("ProjectYARN").enableHiveSupport().getOrCreate()
sc = SS.sparkContext
rdd1 = sc.textFile("file:///home/vijee/Python/car_data.csv")
rdd2 = rdd1.filter(lambda b: "id" not in b)
rdd3 = rdd2.map(lambda a: a.split(","))
rdd4 = rdd3.map(lambda c: Row(id=int(c[0]),CarBrand=c[1],Price=int(c[2]),Caryear=int(c[3]),Color=c[4]))
df11 = SS.createDataFrame(rdd4)
df11.createOrReplaceTempView("table1")
SS.sql("select * from table1 where CarBrand='GMC'").show()
SS.stop()
Can somebody give me the solution where I am doing mistake? How to solve this issue?
Remove --py-files. That's for the purpose of adding modules, not for specifying the script that you're going to run.
bin/spark-submit --master yarn --deploy-mode cluster /home/vijee/Python/PythonScriptOnYARN.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With