Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays: One for indices. The other for values.
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.
A sparse vector is a vector having a relatively small number of nonzero elements. Consider the following as an example of a sparse vector x with n elements, where n is 11, and vector x is: (0.0, 0.0, 1.0, 0.0, 2.0, 3.0, 0.0, 4.0, 0.0, 5.0, 0.0) In Storage.
If you have the data in a DataFrame
, then VectorAssembler
would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors
. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List
in the second argument represent the position of the feature, and the values in the the third List
represent the value for that column, therefore SparseVectors
are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1]
and the correct representation would be Vectors.sparse(2, [0,1], [1,1])
, not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With