KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error! Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization): <pre class="prettyprint"><code>from pyspark.context import SparkContext from pyspark.mllib.clustering import KMeans from pyspark.mllib.random import RandomRDDs if __name__ == "__main__": sc = SparkContext(appName='kmeansMinimalExample') # same with 10000 points data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64) C = KMeans.train(data, 8192, maxIterations=10) sc.stop() </code></pre> The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting: <img src="https://i.stack.imgur.com/aggpL.png" alt="enter image description here"> If I use <code>k=81</code>, instead of 8192, it will succeed: <img src="https://i.stack.imgur.com/zBKqG.png" alt="enter image description here"> Notice that the two calls of <code>takeSample()</code>, should not be an issue, since there were called twice in the random initialization case. So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce? <hr> If it was a memory issue, I would get warnings and errors, as I had been before. Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs. <hr> From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment... <img src="https://i.stack.imgur.com/osDrR.png" alt="enter image description here">

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560 First, there are some code optimizations that would speed up the init by about 13%. However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now. In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

Is Spark's KMeans unable to handle bigdata?

Tags:

python

k-means

apache-spark

bigdata

apache-spark-mllib

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!

Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):

from pyspark.context import SparkContext

from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs


if __name__ == "__main__":
    sc = SparkContext(appName='kmeansMinimalExample')

    # same with 10000 points
    data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
    C = KMeans.train(data, 8192,  maxIterations=10)    

    sc.stop()

The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:

enter image description here

If I use k=81, instead of 8192, it will succeed:

enter image description here

Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.

So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?

If it was a memory issue, I would get warnings and errors, as I had been before.

Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.

From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

enter image description here

448

asked Sep 01 '16 00:09

gsamaras

1 Answers

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.

I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560

First, there are some code optimizations that would speed up the init by about 13%.

However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.

In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

135

answered Sep 23 '22 01:09

Sean Owen

Related questions
                            
                                Python: Find pattern in a string
                            
                                Pymongo - tailing oplog [duplicate]
                            
                                using Kivy Garden Graph in KV language
                            
                                How do I stop PIL from swapping height/width when rotating an image 90°?
                            
                                AWS Elastic Beanstalk CLI does not prompt to create new keypair
                            
                                Apache Spark: How to create a matrix from a DataFrame?
                            
                                Difference between PyMongo and Flask-PyMongo libraries
                            
                                AttributeError: 'ManyToManyField' object has no attribute '_m2m_reverse_name_cache'
                            
                                wxPython threads blocking
                            
                                Preferred block size when reading/writing big binary files
                            
                                Denormalization of predicted data in neural networks
                            
                                Python/Django: Why does importing a module right before using it prevent a circular import?
                            
                                How to distribute type hints to PyPi?
                            
                                Python SSL requests and Let's Encrypt certs
                            
                                Python: ensure os.environ and sys.path are equal: web-requests, shell, cron, celery
                            
                                Save Matplotlib plot image into Django model
                            
                                How do I retrieve output from Multiprocessing in Python?
                            
                                Cannot import name 'spawn' for pexpect while using pxssh
                            
                                Django | joined path is located outside of the base path component {% static img.thumbnail.url %}, Error 400 with whitenoise
                            
                                Selenium not freeing up memory even after calling close/quit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is Spark's KMeans unable to handle bigdata?

Tags:

python

k-means

apache-spark

bigdata

apache-spark-mllib

gsamaras

People also ask

1 Answers

Sean Owen

Recent Activity

Donate For Us