Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark : multiple spark-submit in parallel

I have a generic question about Apache Spark :

We have some spark streaming scripts that consume Kafka messages. Problem : they are failing randomly without a specific error...

Some script does nothing while they are working when I run them manually, one is failing with this message :

ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

So I'm wondering if there is maybe a specific way to run the scripts in parallel ?

They are all in the same jar and I run them with Supervisor. Spark is installed on Cloudera Manager 5.4 on Yarn.

Here is how I launch a script :

sudo -u spark spark-submit --class org.soprism.kafka.connector.reader.TwitterPostsMessageWriter /home/soprism/sparkmigration/data-migration-assembly-1.0.jar --master yarn-cluster --deploy-mode client

Thanks for your help !

Update : I changed the command and now run this (it stops with now specific message) :

root@ns6512097:~# sudo -u spark spark-submit --class org.soprism.kafka.connector.reader.TwitterPostsMessageWriter --master yarn --deploy-mode client /home/soprism/sparkmigration/data-migration-assembly-1.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/09/28 16:14:21 INFO Remoting: Starting remoting
15/09/28 16:14:21 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:52748]
15/09/28 16:14:21 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:52748]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
like image 425
Taoma_k Avatar asked Sep 28 '15 09:09

Taoma_k


People also ask

Can Spark run multiple jobs in parallel?

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save , collect ) and any tasks that need to run to evaluate that action.

Can we have multiple jobs in single executor in Spark?

Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs. WARNING: It does not mean spark cannot run concurrent jobs. Through this article we will explore how we can boost our default spark application's performance by running multiple jobs(spark actions) at once.

How do I run multiple Spark contexts?

If you have a necessity to work with lots of Spark contexts, you can turn on special option [MultipleContexts] (1) , but it is used only for Spark internal tests and is not supposed to be used in user programs. You will get unexpected behavior while running more than one Spark context in a single JVM [SPARK-2243] (2).

Does Spark run in parallel?

How Does Spark's Parallel Processing Work Like a Charm? There is a driver program within the Spark cluster where the application logic execution is stored. Here, data is processed in parallel with multiple workers. This kind of data processing is not an ideal practice, but this is how it typically happens.

How does spark run multiple tasks in parallel?

In other words, once a spark action is invoked, a spark job comes into existence which consists of one or more stages and further these stages are broken down into numerous tasks which are worked upon by the executors in parallel. Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs.

What is the use of spark submit command?

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark). spark-submit command supports the following.

How to use parallelize () method in spark?

In order to use the parallelize () method, the first thing that has to be created is a SparkContext object. It can be created in the following way: 1. Import following classes : 2. Create SparkConf object : Master and AppName are the minimum properties that have to be set in order to run a spark application. 3.

Can spark run multiple jobs at once?

Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs. WARNING: It does not mean spark cannot run concurrent jobs. Through this article we will explore how we can boost our default spark application’s performance by running multiple jobs (spark actions) at once.


1 Answers

This issue occurs if multiple users tries to start spark session at the same time or existing spark session are not property closed

There are two ways to fix this issue.

  • Start new spark session on a different port as follow

    spark-submit --conf spark.ui.port=5051 <other arguments>`<br>`spark-shell --conf spark.ui.port=5051
    
  • Find all spark session using ports from 4041 to 4056 and kill process using kill command, netstat and kill command can be used to find process which are occupying the port and kill the process respectively. Here's the usage:

    sudo netstat -tunalp | grep LISTEN| grep 4041
    

Above command will produce output as below, last column is process id, in this case PID is 32028

tcp        0      0 :::4040    :::*         LISTEN      32028/java

Once you find out the process id(PID) you can kill the spark process(spark-shell or spark-submit) using the below command

sudo kill -9 32028
like image 137
SachinJ Avatar answered Oct 19 '22 06:10

SachinJ