Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Apache spark mllib.linalg vectors and spark.util vectors for machine learning

I'm trying to implement neural networks in spark and scala but unable to perform any vector or matrix multiplication. Spark provide two vectors. Spark.util vector support dot operation but it is deprecated. mllib.linalg vectors do not support operations in scala.

Which one to use to store weights and training data?

How to perform vector multiplication in spark scala with mllib like w*x where w is vector or matrix of weights and x is input. pyspark vector support dot product but in scala I'm not able to find such function in vectors

like image 810
gaurav.rai Avatar asked Jan 20 '16 04:01

gaurav.rai


People also ask

What is the difference between spark ML and spark MLlib?

Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.

Can Apache Spark be used for machine learning?

Apache Spark is known as a fast, easy-to-use and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML) and graph processing.

What is spark MLlib used for?

Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that discusses both high-quality algorithm and high speed.


1 Answers

Well, if you need a full support for linear algebra operators you have to implement these by yourself or use an external library. In the second case the obvious choice is Breeze.

It is already used behind the scenes so doesn't introduce additional dependencies and you can easily modify existing Spark code for conversions:

import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}

def toBreeze(v: Vector): BV[Double] = v match {
  case DenseVector(values) => new BDV[Double](values)
  case SparseVector(size, indices, values) => {
    new BSV[Double](indices, values, size)
  }
}

def toSpark(v: BV[Double]) = v match {
  case v: BDV[Double] => new DenseVector(v.toArray)
  case v: BSV[Double] => new SparseVector(v.length, v.index, v.data)
}

Mahout provides interesting Spark and Scala bindings you may find interesting as well.

For simple matrix vector multiplications it can be easier to leverage existing matrix methods. For example IndexedRowMatrix and RowMatrix provide multiply methods which can take a local matrix. You can check Matrix Multiplication in Apache Spark for an example usage.

like image 188
zero323 Avatar answered Sep 25 '22 07:09

zero323