I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
idx
and all other documentsThe code I have at the moment is:
import h5py
import numpy as np
def get_related(tfidf, idx) :
''' return the top documents '''
# calculate inner product
v = np.inner(tfidf, tfidf[idx].transpose())
# sort
vs = np.sort(v.toarray(), axis=0)[::-1]
scores = vs[1:,]
# sort indices
vi = np.argsort(v.toarray(), axis=0)[::-1]
idxs = vi[1:,]
return (scores, idxs)
where tfidf
is a sparse matrix of type '<type 'numpy.float64'>'
.
This seems inefficient, as the sort is performed twice (sort()
then argsort()
), and the results have to then be reversed.
toarray()
?You can use the flip commands numpy. flipud() or numpy. fliplr() to get the indexes in descending order after sorting using the argsort command.
numpy.argsort() function is used to perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as arr that would sort the array. It means indices of value arranged in ascending order.
Sort in Descending order The sort() method accepts a reverse parameter as an optional argument. Setting reverse = True sorts the list in the descending order.
sort() returns the sorted array whereas np. argsort() returns an array of the corresponding indices. The figure shows how the algorithm transforms an unsorted array [10, 6, 8, 2, 5, 4, 9, 1] into a sorted array [1, 2, 4, 5, 6, 8, 9, 10] .
I don't think there's any real need to skip the toarray
. The v
array will be only n_docs
long, which is dwarfed by the size of the n_docs
× n_terms
tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).
The double sort can be skipped by doing
v = v.toarray()
vi = np.argsort(v, axis=0)[::-1]
vs = v[vi]
Btw., your use of np.inner
on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is
v = (tfidf * tfidf[idx, :]).transpose()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With