Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate Sparse Vectors in Spark?

Say you have two Sparse Vectors. As an example:

val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]

I want to concatenate these two vectors so that the result is equivalent to:

val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]

Does Spark have any such convenience method to do this?

like image 288
rocket_raccoon Avatar asked Dec 04 '15 21:12

rocket_raccoon


People also ask

What are sparse vectors in spark?

A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays: One for indices. The other for values.

What is dense vector in spark?

A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.

What are sparse vectors?

A sparse vector is a vector having a relatively small number of nonzero elements. Consider the following as an example of a sparse vector x with n elements, where n is 11, and vector x is: (0.0, 0.0, 1.0, 0.0, 2.0, 3.0, 0.0, 4.0, 0.0, 5.0, 0.0) In Storage.


2 Answers

If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:

from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])

assembler = VectorAssembler(
    inputCols=["userFeatures1", "userFeatures2"],
    outputCol="features")

output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)

You would get the following output for this:

+---------------------------------------------------------------------------+-----+
|features                                                                   |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18],    [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
like image 187
rakeb Avatar answered Oct 17 '22 11:10

rakeb


I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.

If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.

In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):

def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
    val size = v1.size + v2.size
    val maxIndex = v1.size
    val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
    val values = v1.values ++ v2.values
    new SparseVector(size, indices, values)
}
like image 25
Alberto Bonsanto Avatar answered Oct 17 '22 10:10

Alberto Bonsanto