So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following. <pre class="prettyprint"><code>spark-submit --master local[*] pi.py </code></pre> And the code of that program is the following. <pre class="prettyprint"><code>#"""pi.py""" from pyspark import SparkContext import random N = 12500000 def sample(p): x, y = random.random(), random.random() return 1 if x*x + y*y < 1 else 0 sc = SparkContext("local", "Test App") count = sc.parallelize(xrange(0, N)).map(sample).reduce(lambda a, b: a + b) print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES) </code></pre> When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism. How can I read this property from within my python program?

As none of the above really worked for me (maybe because I didn't really understand them), here is my two cents. I was starting my job with <code>spark-submit program.py</code> and inside the file I had <code>sc = SparkContext("local", "Test")</code>. I tried to verify the number of cores spark sees with <code>sc.defaultParallelism</code>. It turned out that it was 1. When I changed the context initialization to <code>sc = SparkContext("local[*]", "Test")</code> it became 16 (the number of cores of my system) and my program was using all the cores. I am quite new to spark, but my understanding is that local by default indicates the use of one core and as it is set inside the program, it would overwrite the other settings (for sure in my case it overwrites those from configuration files and environment variables).

Probably because the call to sc.parallelize puts all the data into one single partition. You can specify the number of partitions as 2nd argument to parallelize: <pre class="prettyprint"><code>part = 16 count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b) </code></pre> Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step. A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation: <pre class="prettyprint"><code>part = 16 count = ( sc.parallelize([0] * part, part) .flatMap(lambda blah: [sample(p) for p in xrange( N/part)]) .reduce(lambda a, b: a + b) ) </code></pre> Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation. So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close) <pre class="prettyprint"><code>count = ( RandomRDDs.uniformRDD(sc, N, part) .zip(RandomRDDs.uniformRDD(sc, N, part)) .filter (lambda (x, y): x*x + y*y < 1) .count() ) </code></pre>

Why is this simple Spark program not utlizing multiple cores?

Tags:

python

scala

apache-spark

multicore

bigdata

So, I'm running this simple program on a 16 core multicore system. I run it by issuing the following.

spark-submit --master local[*] pi.py

And the code of that program is the following.

#"""pi.py"""
from pyspark import SparkContext
import random

N = 12500000

def sample(p):
    x, y = random.random(), random.random()
    return 1 if x*x + y*y < 1 else 0

sc = SparkContext("local", "Test App")
count = sc.parallelize(xrange(0, N)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

When I use top to see CPU consumption, only 1 core is being utilized. Why is it so? Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism. How can I read this property from within my python program?

290

asked Nov 09 '14 13:11

MetallicPriest

2 Answers

As none of the above really worked for me (maybe because I didn't really understand them), here is my two cents.

I was starting my job with spark-submit program.py and inside the file I had sc = SparkContext("local", "Test"). I tried to verify the number of cores spark sees with sc.defaultParallelism. It turned out that it was 1. When I changed the context initialization to sc = SparkContext("local[*]", "Test") it became 16 (the number of cores of my system) and my program was using all the cores.

I am quite new to spark, but my understanding is that local by default indicates the use of one core and as it is set inside the program, it would overwrite the other settings (for sure in my case it overwrites those from configuration files and environment variables).

answered Sep 20 '22 15:09

Ivaylo Petrov

Probably because the call to sc.parallelize puts all the data into one single partition. You can specify the number of partitions as 2nd argument to parallelize:

part = 16
count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b)

Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step.

A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation:

part = 16
count = ( sc.parallelize([0] * part, part)
           .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
           .reduce(lambda a, b: a + b)
       )

Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation. So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close)

count = ( RandomRDDs.uniformRDD(sc, N, part)
        .zip(RandomRDDs.uniformRDD(sc, N, part))
        .filter (lambda (x, y): x*x + y*y < 1)
        .count()
        )

answered Sep 21 '22 15:09

Svend

Related questions
                            
                                Convert Unix epoch time to datetime in Pandas
                            
                                How can I unpack binary hex formatted data in Python?
                            
                                Use of meta class in django
                            
                                Why do I get a TypeError: 'module' object is not callable when trying to import the random module?
                            
                                python domain name split name and extension
                            
                                confusing python urlencode order
                            
                                Take screenshot in Python on Mac OS X
                            
                                How to preserve django test database after running test cases
                            
                                how to loop down in python list (countdown)
                            
                                I thought Python passed everything by reference?
                            
                                How can I get the name of a drive in python
                            
                                SHA1 hash differ between openssl and hashlib/pycrypto
                            
                                Flatten a list in python
                            
                                Python: Produce list which is a sum of two lists, item-wise [duplicate]
                            
                                Python - How to add space on each 3 characters?
                            
                                Start server in current directory (php/apache)
                            
                                Compare `float` and `float64` in python
                            
                                how do you know if your list is ascending in python [duplicate]
                            
                                How would i make a custom error message in python
                            
                                remove unicode emoji using re in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With