Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn

Tags:

I'm trying to use Gaussian Mixture models on a sample of a dataset. I used bothMLlib (with pyspark) and scikit-learn and get very different results, the scikit-learn one looking more realistic.

from pyspark.mllib.clustering import GaussianMixture as SparkGaussianMixture
from sklearn.mixture import GaussianMixture
from pyspark.mllib.linalg import Vectors

Scikit-learn:

local = pd.DataFrame([ x.asDict() for x in df.sample(0.0001).collect() ])
model1 = GaussianMixture(n_components=3)
model1.fit([ [x] for x in local['field'].tolist() ])

model1.means_
array([[7.56123598e+00],
   [1.32517410e+07],
   [3.96762639e+04]])

model1.covariances_
array([[[6.65177423e+00]],
   [[1.00000000e-06]],
   [[8.38380897e+10]]])

MLLib:

model2 = SparkGaussianMixture.train(
    sc.createDataFrame(local).rdd.map(lambda x: Vectors.dense(x.field)),
    k=3,
    convergenceTol=1e-4,
    maxIterations=100
)

model2.gaussians
[MultivariateGaussian(mu=DenseVector([28736.5113]), sigma=DenseMatrix(1, 1, [1094083795.0001], 0)),
 MultivariateGaussian(mu=DenseVector([7839059.9208]), sigma=DenseMatrix(1, 1, [38775218707109.83], 0)),
 MultivariateGaussian(mu=DenseVector([43.8723]), sigma=DenseMatrix(1, 1, [608204.4711], 0))]

However, I'm interested in running the entire dataset through the model which I'm afraid would require parallelizing (and hence use MLlib) to get results in finite time. Am I doing anything wrong / missing something?

Data:

The complete data has an extremely long tail and looks like: enter image description here

whereas the data has a clearly normal dist ceneterd somewhere closer to one clustered by scikit-learn:

enter image description here

I am using Spark 2.3.0 (AWS EMR).

Edit: Initialization params:

local = pd.DataFrame([ x.asDict() for x in df.sample(0.0001).collect() ])
model1 = GaussianMixture(n_components=3, init_params='random')
model1.fit([ [x] for x in local['field'].tolist() ])

model1.means_
array([[2.17611913e+04],
   [8.03184505e+06],
   [7.56871801e+00]])

model1.covariances_
rray([[[1.01835902e+09]],
   [[3.98552130e+13]],
   [[6.95161493e+00]]])
like image 548
ixaxaar Avatar asked Jun 18 '18 18:06

ixaxaar


People also ask

Is spark MLlib deprecated?

Is MLlib deprecated? No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode.

What is MLlib in Apache spark?

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

What is GMM in machine learning?

Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used to classify data into different categories based on the probability distribution. Gaussian mixture models can be used in many different areas, including finance, marketing and so much more!

What is GMM classification?

Abstract: Gaussian Mixture Model (GMM) is a probabilistic model for representing normally distributed subpopulations within an overall population. It is usually used for unsupervised learning to learn the subpopulations and the subpopulation assignment automatically.


1 Answers

This isn't a python problem, per se. It seems to be more of a machine learning/data validation/data segmentation question, IMO. That being said, you are correct in thinking that you must parallelize your work, but it matters in what ways you do it. There are things like 8-bit quantization and model parallelism in your model that you may look into to help you get at what you're after: training a model on large datasets, in a timely manner, without sacrificing data quality or fidelity.

Here is a blog post about quantization: https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

Here is a blog post about model parallelism and 8-bit quantization from Tim Dettmers' Blog: http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/

and the associated paper: https://arxiv.org/pdf/1511.04561.pdf

Though you will want to keep in mind that, depending on FP operations on your GPU, you may not see substantial benefit from this route: https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

HTH and YMMV.

Also, you may want to look into data folding, but can't remember the details nor the paper I read at this point in time. I'll land this here to remember once I do though.

like image 160
tjborromeo Avatar answered Sep 29 '22 21:09

tjborromeo