Page Rank in Python

Tags:

I'm new to Python, and i'm trying to calculate Page Rank vector according to this equation in Python: enter image description here

Where Pi(k) is Page-rank vector after k-Th iteration, G is the Google matrix, H is Hyperlink matrix, a is a dangling node vector, alpha = 0.85 and e is vector of ones.

The calculation with G takes a lot of time, while using the Hyperlink matrix H, which is sparse matrix, should take significantly less time.

Here's my code:

for i in range(1, k_steps+1):
  for j in range(0, len(dictionary_urls)):
    for k in range(0, len(dictionary_urls)):
        if matrix_H[k][j] != 0:
            matrix_pi_k[i][j] += matrix_pi_k[i-1][k] * float(matrix_H[k][j])
        alpha_pi_k_a += matrix_pi_k[i-1][k]*float(vector_a[k])

    alpha_pi_k_a = alpha_pi_k_a * float(alpha)
    alpha_pi_k_a = alpha_pi_k_a + float((1- alpha))
    alpha_pi_k_a = alpha_pi_k_a / float(len(dictionary_urls))
    matrix_pi_k[i][j] = matrix_pi_k[i][j] * float(alpha)

    matrix_pi_k[i][j] = matrix_pi_k[i][j] + float(alpha_pi_k_a)
    alpha_pi_k_a = 0

k_steps is the number of iteration needed.

dictionary_links contains all the URLs.

After code execution, matrix_pi_k should have all the Pi vector's

I calculated all the variables that needed. I got a run time using H matrix is almost equal to run time using G matrix, although, in theory it should be different.

Why? And what should I change to reduce the run time?

Thank you.

257

asked Dec 22 '14 17:12

Roman Yanovitski

2 Answers

The problem is that you're multiplying a sparse matrix by a dense vector using the same-old dense matrix-vector multiplication algorithm. You won't see any speedups that way.

Suppose you have an nxn matrix A (dense or sparse) and an n-vector x. To compute y = Ax, we can write:

y = [0]*n
for i in range(n):
    for j in range(n):
        y[i] += A[i,j]*x[j]

This works whether the matrix A is dense or sparse. Suppose, though, that A is sparse. We still loop over all columns of A to compute a single entry of y, even though most of the entries will be zero. So the outer loop goes through n iterations, and the inner loop also goes through n iterations.

If we know which entries of A are nonzero, we can do much better. Suppose we have a list of all of the nonzero entries of row i, call it nonzero[i]. Then we can replace the inner loop with iteration over this list:

y = [0]*n
for i in range(n):
    for j in nonzero[i]:
        y[i] += A[i,j]*x[j]

So while our outer loop does n iterations, the inner loop only does as many iterations as there are nonzero entries.

This is where the speedup comes with sparse matrix-vector multiplication.

Use `numpy`!

But you have another problem: you're trying to do matrix multiplication with pure Python, which (due to type-checking, non-contiguous data structures, etc.) is slow. The solution is to use numpy, which provides fast algorithms and data structures. Then you can use scipy's sparse matrices, as they implement fast sparse matrix-vector multiplication for you.

Experiment

We can show all of this with a quick experiment. First we'll generate a 10,000 x 10,000 dense matrix A:

>>> import numpy as np
>>> n = 10000
>>> A = np.random.sample((n,n))

Then we'll make a sparse matrix B by thresholding A. B is the same size as A, but only 10% of its entries are nonzero:

>>> B = np.where(A < .1, A, 0).astype(float)

Now we'll make a dense vector to multiply A and B with:

>>> x = np.random.sample(n)
>>> %timeit A.dot(x)
10 loops, best of 3: 46.7 ms per loop
>>> %timeit B.dot(x)
10 loops, best of 3: 43.7 ms per loop

It takes the same amount of time to compute Ax as it does to compute Bx, even though B is "sparse". Of course, it isn't really sparse: it's stored as a dense matrix with a lot of zero entries. Let's make it sparse:

>>> sparse_B = scipy.sparse.csr_matrix(B)
>>> 100 loops, best of 3: 12 ms per loop

There's our speedup! Now, just for fun, what if we store A as a sparse matrix, even though it's really dense?

>>> sparse_A = scipy.sparse.csr_matrix(A)
>>> %timeit sparse_A.dot(x)
10 loops, best of 3: 112 ms per loop

Ouch! But this is to be expected, as storing A as a sparse matrix will incur some overhead during the multiplication.

104

answered Sep 18 '22 23:09

jme

Based on your formula, calculation of matrix H doesn't look faster than for matrix G.

Explanation:

You might want to take a look at an introduction to Big O notation.

The right-most part (after the +) in the formula only consists in a simple calculation without loops, and its Big O notation is just O(1). Which means, it does not depend on the number of urls you are taking into account.

Whereas calculations both for H and G seem to be at least O(n^2) (n being the number of urls).

Edit:

In the deep nested part of your code, you have two instructions, one of them being conditioned upon whether matrix_H[k][j] is 0 or not. Still, if it is 0, which will be the case most of the time if H is a sparse matrix, the second instruction will be executed however. Plus, you enter the loop anyway.

That still gives you a complexity of O(n^2), thus parsing H is not (much) faster than parsing G.

answered Sep 16 '22 23:09

Jivan

Related questions
                            
                                Handling nested elements with Python lxml
                            
                                numpy.array with elements of different shapes
                            
                                Export Google BigQuery data to Python Pandas dataframe
                            
                                Python: open a compressed SQLite database
                            
                                Using multiprocessing with a decorated function results in a PicklingError
                            
                                BeautifulSoup: How to get nested divs
                            
                                Writing COIN-OR CBC Log File
                            
                                How to disable bbox_inches='tight' when working with matplotlib inline in ipython notebook
                            
                                Creating a C# Nullable Int32 within Python (using Python.NET) to call a C# method with an optional int argument
                            
                                Why do I get a 400 when uploading a file with boto?
                            
                                How to repeat something upon exception in python?
                            
                                Python String Replace Error
                            
                                python - how can server initiate a connection to client
                            
                                What is the easiest way to install numpy with LAPACK/BLAS?
                            
                                How to open the user's preferred mail application on Linux?
                            
                                Cryptographic hash functions in Python
                            
                                How to document nested classes with Sphinx's autodoc?
                            
                                Replacing every 2nd element in the list
                            
                                sqlalchemy + MySQL connection timeouts
                            
                                How do I get *change* file time in Windows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Page Rank in Python

Tags:

python

algorithm

pagerank

Roman Yanovitski

People also ask

2 Answers

Use `numpy`!

Experiment

jme

Jivan

Recent Activity

Donate For Us

Page Rank in Python

Tags:

python

algorithm

pagerank

Roman Yanovitski

People also ask

2 Answers

Use numpy!

Experiment

jme

Jivan

Related questions

Recent Activity

Donate For Us

Use `numpy`!