The documentation of Spark, for creating a pyspark.ml.linalg.SparseMatrix
says:
Column-major sparse matrix. The entry values are stored in Compressed
Sparse Column (CSC) format. For example, the following matrix
1.0 0.0 4.0
0.0 3.0 5.0
2.0 0.0 6.0
is stored as values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
rowIndices=[0, 2, 1, 0, 1, 2],
colPointers=[0, 2, 3, 6]
Can you explain how do we derive the colPointers
? It says that they represent the index corresponding to the start of a new column, but still I cannot wrap my head around it.
Using characters in the matrix instead of floats makes reading it easier:
a 0 d
0 c e
b 0 f
is stored as
values: [a, b, c, d, e, f]
rowIndices: [0, 2, 1, 0, 1, 2]
colPointers: [0, 2, 3, 6]
values
are your non-null values from the matrixrowIndices
maps the values from values
to their row indices in the matrix, i.e., a
is stored in row 0; b
is stored in row 2, c
is stored in row 1 ... So there is one row index per value in values
colPointers
splits the values
into columns. So we can represent values
list as [|a, b,| c,| d, e, f|]
, with |
as splitters at indices 0, 2, 3, 6 in the list values
:
a
and b
belong to the the first columc
belongs to the second columnd
, e
anf f
belong to the third columnColPointers
allways starts with 0 and ends with a number wich is the length of values
(6
here)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With