Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Low cpu usage while running a spark job

I am running a Spark job. I have 4 cores and worker memory set to 5G. Application master is on another machine in the same network, and does not host any workers. This is my code:

private void myClass() {
    // configuration of the spark context
    SparkConf conf = new SparkConf().setAppName("myWork").setMaster("spark://myHostIp:7077").set("spark.driver.allowMultipleContexts", "true");
    // creation of the spark context in wich we will run the algorithm
    JavaSparkContext sc = new JavaSparkContext(conf);

    // algorithm
    for(int i = 0; i<200; i++) {
        System.out.println("===============================================================");
        System.out.println("iteration : " + i);
        System.out.println("===============================================================");
        ArrayList<Boolean> list = new ArrayList<Boolean>();
        for(int j = 0; j < 1900; j++){
            list.add(true);
        }
        JavaRDD<Ant> ratings = sc.parallelize(list, 100)
                    .map(bool -> new myObj())
                    .map(obj -> this.setupObj(obj))
                    .map(obj -> this.moveObj(obj))
                    .cache();
        int[] stuff = ratings
                    .map(obj -> obj.getStuff())
                    .reduce((obj1,obj2)->this.mergeStuff(obj1,obj2));
        this.setStuff(tour);

        ArrayList<TabObj> tabObj = ratings
                    .map(obj -> this.objToTabObjAsTab(obj))
                    .reduce((obj1,obj2)->this.mergeTabObj(obj1,obj2));
        ratings.unpersist(false);

        this.setTabObj(tabObj);
    }

    sc.close();
}

When I start it, I can see progress on the Spark UI, but it is really slow (I have to set the parrallelize quite high, otherwise I have a timeout issue). I thought it was a CPU bottleneck, but the JVM CPU consumption is actually very low (most of the time it is 0%, sometime a bit more than 5%...).

The JVM is using around 3G Of memory according to the monitor, with only 19M cached.

The master host has 4 cores, and less memory (4G). That machine shows 100% CPU consumption (a full core) and I don't understand why it is that high... It just has to send partitions to the worker on the other machine, right?

Why is CPU consumption low on the worker, and high on the master?

like image 354
DeepProblems Avatar asked Jul 07 '17 13:07

DeepProblems


People also ask

Is Spark memory intensive?

Spark relies heavily on cluster memory (RAM) as it performs parallel computing in memory across nodes to reduce the I/O and execution times of tasks. It is important to configure the Spark application appropriately based on data and processing requirements for it to be successful.

Why do Spark jobs fail?

In Spark, stage failures happen when there's a problem with processing a Spark task. These failures can be caused by hardware issues, incorrect Spark configurations, or code problems. When a stage failure occurs, the Spark driver logs report an exception similar to the following: org.

How many executors should I use Spark?

Five executors with 3 cores or three executors with 5 cores The consensus in most Spark tuning guides is that 5 cores per executor is the optimum number of cores in terms of parallel processing.

What is JVM CPU usage?

A JVM may max out on CPU usage because of the incoming workload. The server capacity may not be sized sufficiently to handle the rate of requests coming in and in such a situation, the Java application may be doing work, trying to keep up with the workload.


1 Answers

  1. Make sure you have submit your Spark job by Yarn or mesos in the cluster, otherwise it may only running in your master node.

  2. As your code are pretty simple it should be very fast to finish the computation, but i suggest to use wordcount example try to read few GB of input sources to test how the CPU consuming looks like.

  3. Please use "local[*]" . * means use your All cores for computatation

    SparkConf sparkConf = new SparkConf().set("spark.driver.host", "localhost").setAppName("unit-testing").setMaster("local[*]"); References: https://spark.apache.org/docs/latest/configuration.html

  4. In spark there have a lot of things could influence the CPU and memory usage, such as executors and each spark.executor.memory you like to distribute.

like image 78
SharpLu Avatar answered Sep 19 '22 16:09

SharpLu