I have a sparse matrix that I obtained by using Sklearn's TfidfVectorizer object: <pre class="prettyprint"><code>vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', vocabulary=my_vocab, stop_words='english') tfidf = vect.fit_transform([my_docs]) </code></pre> The sparse matrix is (taking out the numbers for generality): <pre class="prettyprint"><code><sparse matrix of type '<type 'numpy.float64'>' with stored elements in Compressed Sparse Row format>] </code></pre> I am trying to get a numeric value for each row to tell me how high a document had the terms I am looking for. I don't really care about which words it contained, I just want to know how many it contained. So I want to get the norm of each or the row*row.T. However, I am having a very hard time working with numpy to obtain this. My first approach was to just simply do: <pre class="prettyprint"><code>tfidf[i] * numpy.transpose(tfidf[i]) </code></pre> However, numpy will apparently not transpose an array with less than one dimension so that will just square the vector. So I tried doing: <pre class="prettyprint"><code>tfidf[i] * numpy.transpose(numpy.atleast_2d(tfidf[0])) </code></pre> But numpy.transpose(numpy.atleast_2d(tfidf[0])) still would not transpose the row. I moved on to trying to get the norm of the row (that approach is probably better anyways). My initial approach was using numpy.linalg. <pre class="prettyprint"><code>numpy.linalg.norm(tfidf[0]) </code></pre> But that gave me a "dimension mismatch" error. So I tried to calculate the norm manually. I started by just setting a variable equal to a numpy array version of the sparse matrix and printing out the len of the first row: <pre class="prettyprint"><code>my_array = numpy.array(tfidf) print my_array print len(my_array[0]) </code></pre> It prints out my_array correctly, but when I try to access the len it tells me: <pre class="prettyprint"><code>IndexError: 0-d arrays can't be indexed </code></pre> I just simply want to get a numeric value of each row in the sparse matrix returned by fit_transform. Getting the norm would be best. Any help here is very appreciated.

Some simple fake data: <pre class="prettyprint"><code>a = np.arange(9.).reshape(3,3) s = sparse.csr_matrix(a) </code></pre> To get the norm of each row from the sparse, you can use: <pre class="prettyprint"><code>np.sqrt(s.multiply(s).sum(1)) </code></pre> And the renormalized <code>s</code> would be <pre class="prettyprint"><code>s.multiply(1/np.sqrt(s.multiply(s).sum(1))) </code></pre> or to keep it sparse before renormalizing: <pre class="prettyprint"><code>s.multiply(sparse.csr_matrix(1/np.sqrt(s.multiply(s).sum(1)))) </code></pre> To get ordinary matrix or array from it, use: <pre class="prettyprint"><code>m = s.todense() a = s.toarray() </code></pre> If you have enough memory for the dense version, you can get the norm of each row with: <pre class="prettyprint"><code>n = np.sqrt(np.einsum('ij,ij->i',a,a)) </code></pre> or <pre class="prettyprint"><code>n = np.apply_along_axis(np.linalg.norm, 1, a) </code></pre> To normalize, you can do <pre class="prettyprint"><code>an = a / n[:, None] </code></pre> or, to normalize the original array in place: <pre class="prettyprint"><code>a /= n[:, None] </code></pre> The <code>[:, None]</code> thing basically transposes <code>n</code> to be a vertical array.

Get norm of numpy sparse matrix rows

Tags:

python

arrays

matrix

numpy

norm

I have a sparse matrix that I obtained by using Sklearn's TfidfVectorizer object:

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', vocabulary=my_vocab, stop_words='english')
tfidf = vect.fit_transform([my_docs])

The sparse matrix is (taking out the numbers for generality):

<sparse matrix of type '<type 'numpy.float64'>'
with stored elements in Compressed Sparse Row format>]

I am trying to get a numeric value for each row to tell me how high a document had the terms I am looking for. I don't really care about which words it contained, I just want to know how many it contained. So I want to get the norm of each or the row*row.T. However, I am having a very hard time working with numpy to obtain this.

My first approach was to just simply do:

tfidf[i] * numpy.transpose(tfidf[i])

However, numpy will apparently not transpose an array with less than one dimension so that will just square the vector. So I tried doing:

tfidf[i] * numpy.transpose(numpy.atleast_2d(tfidf[0]))

But numpy.transpose(numpy.atleast_2d(tfidf[0])) still would not transpose the row.

I moved on to trying to get the norm of the row (that approach is probably better anyways). My initial approach was using numpy.linalg.

numpy.linalg.norm(tfidf[0])

But that gave me a "dimension mismatch" error. So I tried to calculate the norm manually. I started by just setting a variable equal to a numpy array version of the sparse matrix and printing out the len of the first row:

my_array = numpy.array(tfidf)
print my_array
print len(my_array[0])

It prints out my_array correctly, but when I try to access the len it tells me:

IndexError: 0-d arrays can't be indexed

I just simply want to get a numeric value of each row in the sparse matrix returned by fit_transform. Getting the norm would be best. Any help here is very appreciated.

944

asked Nov 23 '13 22:11

Sterling

2 Answers

Some simple fake data:

a = np.arange(9.).reshape(3,3)
s = sparse.csr_matrix(a)

To get the norm of each row from the sparse, you can use:

np.sqrt(s.multiply(s).sum(1))

And the renormalized s would be

s.multiply(1/np.sqrt(s.multiply(s).sum(1)))

or to keep it sparse before renormalizing:

s.multiply(sparse.csr_matrix(1/np.sqrt(s.multiply(s).sum(1))))

To get ordinary matrix or array from it, use:

m = s.todense()
a = s.toarray()

If you have enough memory for the dense version, you can get the norm of each row with:

n = np.sqrt(np.einsum('ij,ij->i',a,a))

n = np.apply_along_axis(np.linalg.norm, 1, a)

To normalize, you can do

an = a / n[:, None]

or, to normalize the original array in place:

a /= n[:, None]

The [:, None] thing basically transposes n to be a vertical array.

answered Sep 18 '22 16:09

askewchan

scipy.sparse is a great package, and it keeps getting better with every release, but a lot of things are still only half cooked, and you can get big performance improvements if you implement some of the algorithms yourself. For instance, a 7x improvement over @askewchan's implementation using scipy functions:

In [18]: a = sps.rand(1000, 1000, format='csr')

In [19]: a
Out[19]: 
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

In [20]: %timeit a.multiply(a).sum(1)
1000 loops, best of 3: 288 us per loop

In [21]: %timeit np.add.reduceat(a.data * a.data, a.indptr[:-1])
10000 loops, best of 3: 36.8 us per loop

In [24]: np.allclose(a.multiply(a).sum(1).ravel(),
    ...:             np.add.reduceat(a.data * a.data, a.indptr[:-1]))
Out[24]: True

You can similarly normalize the array in place doing the following:

norm_rows = np.sqrt(np.add.reduceat(a.data * a.data, a.indptr[:-1]))
nnz_per_row = np.diff(a.indptr)
a.data /= np.repeat(norm_rows, nnz_per_row)

If you are going to be using sparse matrices often, read the wikipedia page on compressed sparse formats, and you will often find better ways than the default to do things.

answered Sep 16 '22 16:09

Jaime

Related questions
                            
                                python pack() and grid() methods together
                            
                                Get indices for all elements in an array in numpy
                            
                                Regex Apostrophe how to match?
                            
                                How to pass sys.argv[n] into a function in Python
                            
                                how to create a date object in python representing a set number of days
                            
                                ptrepack sortby needs 'full' index
                            
                                Numpy warning:Casting Complex to real discards imaginary part
                            
                                Viewing a list of all python operators via the interpreter
                            
                                In Python, how to test whether a line is the last one?
                            
                                Connect to an already running instance of chrome using selenium in python
                            
                                Usecase of |= in python
                            
                                plot decision boundary matplotlib
                            
                                List comprehension and function returning multiple values
                            
                                Python: how to count overlapping occurrences of a substring [duplicate]
                            
                                Reverse complement DNA
                            
                                Inserting values into a sorted array
                            
                                scipy odeint with complex initial values
                            
                                "2+2=5" Python edition
                            
                                AttributeError: 'module' object has no attribute 'celery'
                            
                                python list comprehension and extend()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With