How to create a sparse CSCMatrix using Spark?

Question

The documentation of Spark, for creating a pyspark.ml.linalg.SparseMatrix says:

Column-major sparse matrix. The entry values are stored in Compressed
Sparse Column (CSC) format. For example, the following matrix

   1.0 0.0 4.0 
   0.0 3.0 5.0
   2.0 0.0 6.0   

is stored as values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 
rowIndices=[0, 2, 1, 0, 1, 2], 
colPointers=[0, 2, 3, 6]

Can you explain how do we derive the colPointers? It says that they represent the index corresponding to the start of a new column, but still I cannot wrap my head around it.

mquantin · Accepted Answer

Using characters in the matrix instead of floats makes reading it easier:

a 0 d 
0 c e
b 0 f

is stored as

values: [a, b, c, d, e, f]
rowIndices: [0, 2, 1, 0, 1, 2]
colPointers: [0, 2, 3, 6]

values are your non-null values from the matrix
rowIndices maps the values from values to their row indices in the matrix, i.e., a is stored in row 0; b is stored in row 2, c is stored in row 1 ... So there is one row index per value in values
colPointers splits the values into columns. So we can represent values list as [|a, b,| c,| d, e, f|], with | as splitters at indices 0, 2, 3, 6 in the list values:
- a and b belong to the the first colum
- c belongs to the second column
- d, e anf f belong to the third column
- Note that ColPointers allways starts with 0 and ends with a number wich is the length of values (6 here)

How to create a sparse CSCMatrix using Spark?

Tags:

python

matrix

apache-spark

pyspark

Dimitris Poulopoulos

1 Answers

mquantin

Recent Activity

Donate For Us

How to create a sparse CSCMatrix using Spark?

Tags:

python

matrix

apache-spark

pyspark

Dimitris Poulopoulos

1 Answers

mquantin

Related questions

Recent Activity

Donate For Us