I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.
My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.
Naturally, a simple solution is
cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )
My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:
timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop
timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop
timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop
[matlab]
cholS \ X
0.675 s
[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s
Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?
The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.
I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.
For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.
Why not just use the equation: Q = inv(chol(S)) * X
, here is my test:
import scipy.linalg
import numpy as np
N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)
cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)
np.allclose(Y1, Y2)
output:
True
Here is the time test:
%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)
output:
CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s
I don't know why scipy.linalg.solve_triangular
is slower than numpy.linalg.solve
on your system, but the inv
version is the fastest.
TL;DR: Don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
I found this thread after stumbling across some discrepancies between numpy.linalg.solve
and scipy.linalg.solve
(and scipy's lu_solve
, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.
With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):
I see a significant difference between numpy.linalg.solve
and scipy.linalg.solve
when solving for a vector X
(i.e., X
is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.
However, most of the difference appears to be due to scipy's solve
checking for invalid entries. When passing check_finite=False
into scipy.linalg.solve, scipy's solve
runtime is 1.02x numpy's.
Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True
is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False
has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve
is very slightly faster than either of these cases.
The above are for a vector X
. If I make X
a wide array, specifically 160 by 10000, scipy.linalg.solve
with check_finite=False
is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True
. Scipy's solve
(without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False
) call. Numpy's solve
has runtime 1.03x scipy's fastest for this array X
case.
scipy.linalg.solve_triangular
delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False
. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular
's, for vector and array X
, respectively, with check_finite=False
.
solve_triangular
with destructive computation (overwrite_b=True
) gives you no speedup on top of check_finite=False
(and actually hurts slightly for the array X
case).
I, ignoramus, was previously unaware of solve_triangular
and was using scipy.linalg.lu_solve
as a triangular solver, i.e., instead of solve_triangular(cholS, X)
doing lu_solve((cholS, numpy.arange(160)), X)
(both produce the same answer). But I discovered that lu_solve
used in this way has runtime 1.07x unsafe solve_triangular
for the vector X
case, while its runtime was 1.76x for the array X
case. I'm not sure why lu_solve
is so much slower for array X
, compared to vector X
, but the lesson is to use solve_triangular
(without infinite checks).
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix
.
I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1
) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X
case and 6.3x longer runtime for the fat matrix X
case.
However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000
to disable the fat matrix X
case and do the n=1
vector case. (Sorry.)
import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np
RNG = RandomState(69)
m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
Ac = np.triu(Ac)
bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")
if 0: # Save to Matlab format
import scipy.io as io
io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
import sys
sys.exit(0)
def lapper(fn, source, **kwargs):
Alocal = source[0].copy()
blocal = source[1].copy()
fn(Alocal, blocal,**kwargs)
laps = (1000 if n<=1 else 100)
def printer(t, s=''):
print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
return t/float(laps)
t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
"scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
"numpy.solve"))
#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))
print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve destructive"))
print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
check_finite=False), number=laps),
"scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve_triangular destructive"))
print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
(Ac,bc)), number=laps), "LU"))
print "all times:"
print t
Output of the above script for the vector case, n=1
:
C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]
Output of the above script for the matrix case n=10000
:
C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]
Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0
, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:
clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(@() A\b)
q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(@() A\b)
You'll need the timeit
function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:
matrix_time =
0.0099989
vector_time =
2.2487e-05
The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve
when you have a triangular system, just use scipy.linalg.solve_triangular
with at least the check_finite=False
keyword argument for fast and non-destructive solutions.
A couple of things to try:
X = X.copy('F')
# use fortran-order arrays, so that a copy is avoided
Y = solve_triangular(cholS, X, overwrite_b=True)
# avoid another copy, but trash contents of X
Y = solve_triangular(cholS, X, check_finite=False)
# Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...
With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.
I can't reproduce the issue with np.linalg.solve
and scipy.linalg.solve
having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With