Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to further improve sparse solution times using python?

I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.

To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.

For comparison, I have used the following solvers:

Matlab and Octave, direct solver: The canonical x=A\b.

Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1)) (not preconditioner is used).

Scipy (Python), direct solver: linalg.spsolve(A, b) where A is previously formatted in csr_matrix format.

Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)

Scipy (Python), UMFPACK solver: spsolve(A, b) using from scikits.umfpack import spsolve. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev.

Furthermore, the aforementioned python solvers are tested in:

  1. Windows.
  2. Linux.
  3. Mac OS.

Conditions:

  • Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
  • Timing is done ten times for each system and an average and a standard deviation is computed.

Hardware:

  • Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU @2.6GHz 2.59GHz, 32 Gb RAM DDR4.
  • Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.

Results: Time to solve vs problem size

Observations:

  • Matlab A\b is the fastest despite being in an older computer.
  • There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
  • One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
  • The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?

If you want to reproduce the tests, I leave here very simple scripts. For matlab/octave:

IJS=load('KbN1M.txt');
b=load('FbN1M.txt');

I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);

Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
    tic
    A=sparse(I,J,S);
    tsparse(i)=toc;
    tic
    x=A\b;
    tsolve_direct(i)=toc;        
    tic
    x2=pcg(A,b,1e-5,size(b,1));
    tsolve_pcg(i)=toc;
end

save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg

For python:

import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX


b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')

I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]

I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])

Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
    t = time.time()
    A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
    A=sp.csr_matrix(A)
    time_sparse[i,0]=time.time()-t
    t = time.time()
    x=linalg.spsolve(A, b)
    time_direct[i,0] = time.time() - t
    t = time.time()
    x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
    time_conj[i,0] = time.time() - t
    t = time.time()
    x3 = spsolve(A, b) #ONLY IN LINUX
    time_umfpack[i,0] = time.time() - t

np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')

Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.

P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver Issues using the scipy.sparse.linalg linear system solvers How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg? But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.

EDIT_1: I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance. To do this, one needs to add:

from sparse_dot_mkl import sparse_qr_solve_mkl

and

sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))

to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system. Added MKL solver, UMFPACK highlighted

like image 925
uom0 Avatar asked Oct 17 '20 10:10

uom0


1 Answers

I will try to answer to myself. To provide an answer, I tried an even more demanding example, with a matrix of size of (N,N) of about half a million by half a million and the corresponding vector (N,1). This, however, is much less sparse (more dense) than the one provided in the question. This matrix stored in ascii is of about 1.7 Gb, compared to the one of the example, which is of about 0.25 Gb (despite its "size" is larger). See its shape here,

enter image description here

Then, I tried to solve Ax=b using again Matlab, Octave and Python using the aforementioned the direct solvers from scipy, the intel MKL wrapper, the UMFPACK from Tim Davis. My first surprise is that both Matlab and Octave could solve the systems using the A\b, which is not for certain that it is a direct solver, since it chooses the best solver based on the characteristics of the matrix, see Matlab's x=A\b. However, the python's linalg.spsolve , the MKL wrapper and the UMFPACK were throwing out-of-memory errors in Windows and Linux. In mac, the linalg.spsolve was somehow computing a solution, and alghouth it was with a very poor performance, it never through memory errors. I wonder if the memory is handled differently depending on the OS. To me, it seems that mac swapped memory to the hard drive rather than using it from the RAM. The performance of the CG solver in Python was rather poor, compared to the matlab. However, to improve the performance in the CG solver in python, one can get a huge improvement in performance if A=0.5(A+A') is computed first (if one obviously, have a symmetric system). Using a preconditioner in Python did not help. I tried using the sp.linalg.spilu method together with sp.linalg.LinearOperator to compute a preconditioner, but the performance was rather poor. In matlab, one can use the incomplete Cholesky decomposition.

For the out-of-memory problem the solution was to use an LU decomposition and solve two nested systems, such as Ax=b, A=LL', y=L\b and x=y\L'.

I put here the min. solution times,

Matlab mac, A\b = 294 s.
Matlab mac, PCG (without conditioner)= 17.9 s.
Matlab mac, PCG (with incomplete Cholesky conditioner) = 9.8 s.
Scipy mac, direct = 4797 s.
Octave, A\b = 302 s.
Octave, PCG (without conditioner)= 28.6 s.
Octave, PCG (with incomplete Cholesky conditioner) = 11.4 s.
Scipy, PCG (without A=0.5(A+A'))= 119 s.
Scipy, PCG (with A=0.5(A+A'))= 12.7 s.
Scipy, LU decomposition using UMFPACK (Linux) = 3.7 s total.

So the answer is YES, there are ways to improve the solution times in scipy. The use of the wrappers for UMFPACK (Linux) or intel MKL QR solver is highly recommended, if the memmory of the workstation allows it. Otherwise, performing A=0.5(A+A') prior to using the conjugate gradient solver can have a positive effect in the solution performance if one is dealing with symmetric systems. Let me know if someone would be interested in having this new system, so I can upload it.

like image 127
uom0 Avatar answered Oct 06 '22 10:10

uom0