Speed up solving a triangular linear system with numpy?

Question

I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.

My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.

Naturally, a simple solution is

cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )

My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:

timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop

timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop

timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop

[matlab]
cholS \ X
0.675 s

[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s

Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?

The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.

I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.

For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.

HYRY · Accepted Answer

Why not just use the equation: Q = inv(chol(S)) * X, here is my test:

import scipy.linalg
import numpy as np

N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)

cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)

np.allclose(Y1, Y2)

output:

True

Here is the time test:

%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)

output:

CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s

I don't know why scipy.linalg.solve_triangular is slower than numpy.linalg.solve on your system, but the inv version is the fastest.

Ahmed Fasih · Answer

TL;DR: Don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

I found this thread after stumbling across some discrepancies between numpy.linalg.solve and scipy.linalg.solve (and scipy's lu_solve, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.

With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):

I see a significant difference between numpy.linalg.solve and scipy.linalg.solve when solving for a vector X (i.e., X is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.
However, most of the difference appears to be due to scipy's solve checking for invalid entries. When passing check_finite=False into scipy.linalg.solve, scipy's solve runtime is 1.02x numpy's.
Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve is very slightly faster than either of these cases.
The above are for a vector X. If I make X a wide array, specifically 160 by 10000, scipy.linalg.solve with check_finite=False is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True. Scipy's solve (without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False) call. Numpy's solve has runtime 1.03x scipy's fastest for this array X case.
scipy.linalg.solve_triangular delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular's, for vector and array X, respectively, with check_finite=False.
solve_triangular with destructive computation (overwrite_b=True) gives you no speedup on top of check_finite=False (and actually hurts slightly for the array X case).
I, ignoramus, was previously unaware of solve_triangular and was using scipy.linalg.lu_solve as a triangular solver, i.e., instead of solve_triangular(cholS, X) doing lu_solve((cholS, numpy.arange(160)), X) (both produce the same answer). But I discovered that lu_solve used in this way has runtime 1.07x unsafe solve_triangular for the vector X case, while its runtime was 1.76x for the array X case. I'm not sure why lu_solve is so much slower for array X, compared to vector X, but the lesson is to use solve_triangular (without infinite checks).
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix.

I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X case and 6.3x longer runtime for the fat matrix X case.

However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000 to disable the fat matrix X case and do the n=1 vector case. (Sorry.)

import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np

RNG = RandomState(69)

m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
    Ac = np.triu(Ac)

bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")

if 0: # Save to Matlab format
    import scipy.io as io
    io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
    import sys
    sys.exit(0)

def lapper(fn, source, **kwargs):
    Alocal = source[0].copy()
    blocal = source[1].copy()
    fn(Alocal, blocal,**kwargs)

laps = (1000 if n<=1 else 100)
def printer(t, s=''):
    print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
    return t/float(laps)

t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
                 "scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
                        number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
                 "numpy.solve"))

#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))

print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
                              overwrite_b=True,  check_finite=False),
                        number=laps), "scipy.solve destructive"))

print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
                        number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                              check_finite=False), number=laps),
                 "scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
                                       overwrite_b=True, check_finite=False),
                        number=laps), "scipy.solve_triangular destructive"))

print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
    lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
    (Ac,bc)), number=laps), "LU"))

print "all times:"
print t

Output of the above script for the vector case, n=1:

C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]

Output of the above script for the matrix case n=10000:

C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]

Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:

clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)

q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)

You'll need the timeit function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:

matrix_time =
    0.0099989
vector_time =
   2.2487e-05

The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.

pv. · Answer

A couple of things to try:

X = X.copy('F') # use fortran-order arrays, so that a copy is avoided
Y = solve_triangular(cholS, X, overwrite_b=True) # avoid another copy, but trash contents of X
Y = solve_triangular(cholS, X, check_finite=False) # Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...

With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.

I can't reproduce the issue with np.linalg.solve and scipy.linalg.solve having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.

Speed up solving a triangular linear system with numpy?

Tags:

python

numpy

scipy

linear-algebra

Mike Hughes

3 Answers

HYRY

Ahmed Fasih

pv.

Recent Activity

Donate For Us

Speed up solving a triangular linear system with numpy?

Tags:

python

numpy

scipy

linear-algebra

Mike Hughes

3 Answers

HYRY

Ahmed Fasih

pv.

Related questions

Recent Activity

Donate For Us