I'm new to Python, and i'm trying to calculate Page Rank vector according to this equation in Python:
Where Pi(k) is Page-rank vector after k-Th iteration, G is the Google matrix, H is Hyperlink matrix, a is a dangling node vector, alpha = 0.85 and e is vector of ones.
The calculation with G takes a lot of time, while using the Hyperlink matrix H, which is sparse matrix, should take significantly less time.
Here's my code:
for i in range(1, k_steps+1):
for j in range(0, len(dictionary_urls)):
for k in range(0, len(dictionary_urls)):
if matrix_H[k][j] != 0:
matrix_pi_k[i][j] += matrix_pi_k[i-1][k] * float(matrix_H[k][j])
alpha_pi_k_a += matrix_pi_k[i-1][k]*float(vector_a[k])
alpha_pi_k_a = alpha_pi_k_a * float(alpha)
alpha_pi_k_a = alpha_pi_k_a + float((1- alpha))
alpha_pi_k_a = alpha_pi_k_a / float(len(dictionary_urls))
matrix_pi_k[i][j] = matrix_pi_k[i][j] * float(alpha)
matrix_pi_k[i][j] = matrix_pi_k[i][j] + float(alpha_pi_k_a)
alpha_pi_k_a = 0
k_steps is the number of iteration needed.
dictionary_links contains all the URLs.
After code execution, matrix_pi_k should have all the Pi vector's
I calculated all the variables that needed. I got a run time using H matrix is almost equal to run time using G matrix, although, in theory it should be different.
Why? And what should I change to reduce the run time?
Thank you.
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages.
In their original paper presenting Google, Larry and Sergey define PageRank like this: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)).
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
The problem is that you're multiplying a sparse matrix by a dense vector using the same-old dense matrix-vector multiplication algorithm. You won't see any speedups that way.
Suppose you have an nxn
matrix A
(dense or sparse) and an n
-vector x
. To compute y = Ax
, we can write:
y = [0]*n
for i in range(n):
for j in range(n):
y[i] += A[i,j]*x[j]
This works whether the matrix A
is dense or sparse. Suppose, though, that A
is sparse. We still loop over all columns of A
to compute a single entry of y
, even though most of the entries will be zero. So the outer loop goes through n
iterations, and the inner loop also goes through n
iterations.
If we know which entries of A
are nonzero, we can do much better. Suppose we have a list of all of the nonzero entries of row i
, call it nonzero[i]
. Then we can replace the inner loop with iteration over this list:
y = [0]*n
for i in range(n):
for j in nonzero[i]:
y[i] += A[i,j]*x[j]
So while our outer loop does n
iterations, the inner loop only does as many iterations as there are nonzero entries.
This is where the speedup comes with sparse matrix-vector multiplication.
numpy
!But you have another problem: you're trying to do matrix multiplication with pure Python, which (due to type-checking, non-contiguous data structures, etc.) is slow. The solution is to use numpy
, which provides fast algorithms and data structures. Then you can use scipy
's sparse matrices, as they implement fast sparse matrix-vector multiplication for you.
We can show all of this with a quick experiment. First we'll generate a 10,000 x 10,000
dense matrix A
:
>>> import numpy as np
>>> n = 10000
>>> A = np.random.sample((n,n))
Then we'll make a sparse matrix B
by thresholding A
. B
is the same size as A
, but only 10% of its entries are nonzero:
>>> B = np.where(A < .1, A, 0).astype(float)
Now we'll make a dense vector to multiply A
and B
with:
>>> x = np.random.sample(n)
>>> %timeit A.dot(x)
10 loops, best of 3: 46.7 ms per loop
>>> %timeit B.dot(x)
10 loops, best of 3: 43.7 ms per loop
It takes the same amount of time to compute Ax
as it does to compute Bx
, even though B
is "sparse". Of course, it isn't really sparse: it's stored as a dense matrix with a lot of zero entries. Let's make it sparse:
>>> sparse_B = scipy.sparse.csr_matrix(B)
>>> 100 loops, best of 3: 12 ms per loop
There's our speedup! Now, just for fun, what if we store A
as a sparse matrix, even though it's really dense?
>>> sparse_A = scipy.sparse.csr_matrix(A)
>>> %timeit sparse_A.dot(x)
10 loops, best of 3: 112 ms per loop
Ouch! But this is to be expected, as storing A
as a sparse matrix will incur some overhead during the multiplication.
Based on your formula, calculation of matrix H doesn't look faster than for matrix G.
Explanation:
You might want to take a look at an introduction to Big O notation.
The right-most part (after the +
) in the formula only consists in a simple calculation without loops, and its Big O notation is just O(1)
. Which means, it does not depend on the number of urls you are taking into account.
Whereas calculations both for H and G seem to be at least O(n^2)
(n
being the number of urls).
Edit:
In the deep nested part of your code, you have two instructions, one of them being conditioned upon whether matrix_H[k][j]
is 0 or not. Still, if it is 0, which will be the case most of the time if H is a sparse matrix, the second instruction will be executed however. Plus, you enter the loop anyway.
That still gives you a complexity of O(n^2)
, thus parsing H is not (much) faster than parsing G.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With