Concatenate Sparse Vectors in Spark?

Tags:

apache-spark

Say you have two Sparse Vectors. As an example:

val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]

I want to concatenate these two vectors so that the result is equivalent to:

val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]

Does Spark have any such convenience method to do this?

288

asked Dec 04 '15 21:12

2 Answers

If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:

from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])

assembler = VectorAssembler(
    inputCols=["userFeatures1", "userFeatures2"],
    outputCol="features")

output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)

You would get the following output for this:

+---------------------------------------------------------------------------+-----+
|features                                                                   |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18],    [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+

187

answered Oct 17 '22 11:10

I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.

If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.

In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):

def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
    val size = v1.size + v2.size
    val maxIndex = v1.size
    val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
    val values = v1.values ++ v2.values
    new SparseVector(size, indices, values)
}

answered Oct 17 '22 10:10

Alberto Bonsanto

Related questions
                            
                                Is it possible in Scala to simplify the following if/else statement?
                            
                                override function in trait
                            
                                reduce list of integers/range of integers in scala
                            
                                the += operator on immutable Set
                            
                                Deforestation in Scala collections
                            
                                scala: how to convert ArrayBuffer to a Set?
                            
                                Infinite loop scala code
                            
                                Reference equality for java.lang.String in Scala
                            
                                Scala style: how far to nest functions?
                            
                                Scala: How to modify some variable in class by method? [duplicate]
                            
                                Internal DSL in Scala: Lists without ","
                            
                                Incrementing and getting value
                            
                                Escaping HTML in a Java Play Framework Scala Template
                            
                                Build a scala interpreter in the browser
                            
                                Failing a scalatest when akka actor throws exception outside of the test thread
                            
                                Why blocking on future considered a bad practice?
                            
                                Scala error function deprecated. What is the alternative?
                            
                                Apache Spark: dealing with Option/Some/None in RDDs
                            
                                How to access local files in Spark on Windows?
                            
                                GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concatenate Sparse Vectors in Spark?

Tags:

scala

apache-spark

rocket_raccoon

People also ask

2 Answers

rakeb

Alberto Bonsanto

Recent Activity

Donate For Us