I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b
is passed in via the argument x
, and is overwritten by the solution. Lp
, Li
, Lx
are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f
gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?
This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0
and compiler flags -O3 -march=native
(which is -march=skylake
on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline
which benchmarks the implementation given in the question and benchBacksolveOptimized
which benchmarks the proposed "optimized" implementations. There is also benchFillRhs
which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs
takes should be subtracted.
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
As A is lower triangular, we have i < Li[j]
. Therefore we know that x[Li[j]]
will not change due to the changes to x[i]
in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0
move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
While clang
already starts unrolling the loop after the first suggested code change, gcc 8.3.0
still has not. So let's give that a try by additionally passing -funroll-loops
.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc
might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops
again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.
You might shave a few cycles by using unsigned
instead of int
for the index types, which must be >= 0
anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With