Why is my prof's version of LU decomposition faster than mine? Python numpy

Question

I am attending a course in Numerical Analysis at my university. We are studying LU decomposition. I tried implementing my version before looking at my lecturer's. I thought mine was pretty fast, but actually comparing them, my lecturer's version is much faster even though it uses loops! Why is that?

Lecturer Version

def LU_decomposition(A):
    """Perform LU decomposition using the Doolittle factorisation."""

    L = np.zeros_like(A)
    U = np.zeros_like(A)
    N = np.size(A, 0)

    for k in range(N):
        L[k, k] = 1
        U[k, k] = (A[k, k] - np.dot(L[k, :k], U[:k, k])) / L[k, k]
        for j in range(k+1, N):
            U[k, j] = (A[k, j] - np.dot(L[k, :k], U[:k, j])) / L[k, k]
        for i in range(k+1, N):
            L[i, k] = (A[i, k] - np.dot(L[i, :k], U[:k, k])) / U[k, k]

    return L, U

My Version

def lu(A, non_zero = 1):
    '''
    Given a matrix A, factorizes it into two matrices L and U, where L is
    lower triangular and U is upper triangular. This method implements
    Doolittle's method which sets l_ii = 1, i.e. L is a unit triangular
    matrix.

    :param      A: Matrix to be factorized. NxN
    :type       A: numpy.array

    :param non_zero: Value to which l_ii is assigned to. Must be non_zero.
    :type  non_zero: non-zero float.

    :return: (L, U)
    '''
    # Check if the matrix is square
    if A.shape[0] != A.shape[1]:
        return 'Input argument is not a square matrix.'

    # Store the size of the matrix
    n = A.shape[0]

    # Instantiate two zero matrices NxN (L, U)
    L = np.zeros((n,n), dtype = float)
    U = np.zeros((n,n), dtype = float)

    # Start algorithm
    for k in range(n):
        # Specify non-zero value for l_kk (Doolittle's)
        L[k, k] = non_zero
        # Case k = 0 is trivial
        if k == 0:
            # Complete first row of U
            U[0, :] = A[0, :] / L[0, 0]
            # Complete first column of L
            L[:, 0] = A[:, 0] / U[0, 0]
        # Case k = n-1 is trivial
        elif k == n-1:
            # Obtain  u_nn
            U[-1, -1] = (A[-1, -1] - np.dot(L[-1, :], U[:, -1])) / L[-1, -1]

        else:
            # Obtain u_kk
            U[k, k] = (A[k, k] - np.dot(L[k, :], U[:, k])) / L[k, k]
            # Complete kth row of U
            U[k, k+1:] = (A[k, k+1:] - [np.dot(L[k, :], U[:, i]) for i in \
                         range(k+1, n)]) / L[k, k]
            # Complete kth column of L
            L[k+1:, k] = (A[k+1:, k] - [np.dot(L[i, :], U[:, k]) for i in \
                         range(k+1, n)]) / U[k, k]
    return L, U

Benchmarking

I used the following commands:

A = np.random.randint(1, 10, size = (4,4))
%timeit lu(A)
57.5 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit LU_decomposition(A)
42.1 µs ± 776 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And also, how come scipy's version is so much better?

scipy.linalg.lu(A)
6.47 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Charlie · Accepted Answer

Your code has conditionals in the python code, where the lecture version does not. The numpy library is highly optimized in native code so anything you can do to push the computation into numpy as opposed to python would help make it faster.

Scipy must have an even more optimized version of this in its library, seeing as how it's a single call to do this operation, the outer loop is likely part of the optimized native code instead of the relatively slow python code.

You might try benchmarking using Cython and see what difference a more optimized python runtime has.

user7138814 · Answer

I think yours is slower because of the intermediary data strucutures that you use:

a python list is created with [np.dot(L[k, :], U[:, i]) for i in range(k+1, n)]
a numpy array is created with A[k, k+1:] - temp_list
another temporary numpy array is created with temp_ndarray / L[k, k]
finally, this temporary array is copied into the result array

For each of these steps the CPU has to execute a loop, even if you didn't write one explicitly. Numpy abstracts these loops away but they still have to be executed! Of course oftentimes it can pay off to have X implicit fast loops in numpy instead of 1 python loop, but this is only the case for medium size arrays. Also, a list comprehension really is only marginally faster than a regular for-loop.

The scipy one is faster because it's a highly optimized implementation in a low level proramming language (whereas python is a very high level language). In the end what this probably means is that you should appreciate your prof's code for its elegance and readability not for its speed :)

Why is my prof's version of LU decomposition faster than mine? Python numpy

Tags:

python

arrays

matrix

numpy

scipy

Euler_Salter

2 Answers

Charlie

user7138814

Recent Activity

Donate For Us

Why is my prof's version of LU decomposition faster than mine? Python numpy

Tags:

python

arrays

matrix

numpy

scipy

Euler_Salter

2 Answers

Charlie

user7138814

Related questions

Recent Activity

Donate For Us