Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a sparse CSCMatrix using Spark?

The documentation of Spark, for creating a pyspark.ml.linalg.SparseMatrix says:

Column-major sparse matrix. The entry values are stored in Compressed
Sparse Column (CSC) format. For example, the following matrix

   1.0 0.0 4.0 
   0.0 3.0 5.0
   2.0 0.0 6.0   

is stored as values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 
rowIndices=[0, 2, 1, 0, 1, 2], 
colPointers=[0, 2, 3, 6]

Can you explain how do we derive the colPointers? It says that they represent the index corresponding to the start of a new column, but still I cannot wrap my head around it.

like image 594
Dimitris Poulopoulos Avatar asked Jan 04 '23 19:01

Dimitris Poulopoulos


1 Answers

Using characters in the matrix instead of floats makes reading it easier:

a 0 d 
0 c e
b 0 f 

is stored as

values: [a, b, c, d, e, f]
rowIndices: [0, 2, 1, 0, 1, 2]
colPointers: [0, 2, 3, 6]
  • values are your non-null values from the matrix
  • rowIndices maps the values from values to their row indices in the matrix, i.e., a is stored in row 0; b is stored in row 2, c is stored in row 1 ... So there is one row index per value in values
  • colPointers splits the values into columns. So we can represent values list as [|a, b,| c,| d, e, f|], with | as splitters at indices 0, 2, 3, 6 in the list values:
    • a and b belong to the the first colum
    • c belongs to the second column
    • d, e anf f belong to the third column
    • Note that ColPointers allways starts with 0 and ends with a number wich is the length of values (6 here)
like image 133
mquantin Avatar answered Jan 14 '23 23:01

mquantin