I'd like to write a function that normalizes the rows of a large sparse matrix (such that they sum to one).
from pylab import *
import scipy.sparse as sp
def normalize(W):
z = W.sum(0)
z[z < 1e-6] = 1e-6
return W / z[None,:]
w = (rand(10,10)<0.1)*rand(10,10)
w = sp.csr_matrix(w)
w = normalize(w)
However this gives the following exception:
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 325, in __div__
return self.__truediv__(other)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 230, in __truediv__
raise NotImplementedError
Are there any reasonably simple solutions? I have looked at this, but am still unclear on how to actually do the division.
To normalise rows, just divide by the norm. For example, using L₂ normalisation: >>> l2norm = np.
To normalize a matrix means to scale the values such that that the range of the row or column values is between 0 and 1. The easiest way to normalize the values of a NumPy matrix is to use the normalize() function from the sklearn package, which uses the following basic syntax: from sklearn.
This has been implemented in scikit-learn sklearn.preprocessing.normalize.
from sklearn.preprocessing import normalize w_normalized = normalize(w, norm='l1', axis=1)
axis=1
should normalize by rows, axis=0
to normalize by column. Use the optional argument copy=False
to modify the matrix in place.
While Aarons answer is correct, I implemented a solution when I wanted to normalize with respect to the maximum of the absolute values, which sklearn is not offering. My method uses the nonzero entries and finds them in the csr_matrix.data array to replace values there quickly.
def normalize_sparse(csr_matrix): nonzero_rows = csr_matrix.nonzero()[0] for idx in np.unique(nonzero_rows): data_idx = np.where(nonzero_rows==idx)[0] abs_max = np.max(np.abs(csr_matrix.data[data_idx])) if abs_max != 0: csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]
In contrast to sunan's solution, this method does not require any casting of the matrix into dense format (which could raise memory problems) and no matrix multiplications either. I tested the method on a sparse matrix of shape (35'000, 486'000) and it took ~ 18 seconds.
here is my solution.
transpose C
import scipy.sparse as sp
import numpy as np
import math
minf = 0.0001
A = sp.lil_matrix((5,5))
b = np.arange(0,5)
A.setdiag(b[:-1], k=1)
A.setdiag(b)
print A.todense()
A = A.T
print A.todense()
sum_of_col = A.sum(0).tolist()
print sum_of_col
c = []
for i in sum_of_col:
for j in i:
if math.fabs(j)<minf:
c.append(0)
else:
c.append(1/j)
print c
B = sp.lil_matrix((5,5))
B.setdiag(c)
print B.todense()
C = A*B
print C.todense()
C = C.T
print C.todense()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With