Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX2 sparse matrix multiplication

I'm trying to leverage the new AVX2 GATHER instructions to speed up a sparse matrix - vector multiplication. The matrix is in CSR (or Yale) format with a row pointer that points to a column index array which in turn holds the columns. The C code for such a mat-vec mul does look like this:

for (int row = 0; row < n_rows - 1; row++) {
    double rowsum = 0;
    for (int col = row_ptr[row]; col < row_ptr[row + 1]; col++) {
        rowsum += values[col] * x[col_indices[col]];
    }
    result[row] = rowsum;
}

Now my goal is to accelerate this with AVX2 intrinsics. The following code works with latest Intel or GCC, based on https://blog.fox-toolkit.org/?p=174 . I removed the remainder here because my rows all align on 4 doubles (columns % 4==0) anyway (lucky me). I have the code dealing with remainder too if someone is interested, but the point is, the code is actually slightly slower. I checked the disassembly and for the above version only FP instructions are generated and for my AVX2 code all AVX2 ops appear as expected. Even with small matrices that fit into the cache the AVX2 version is no good. I'm puzzled here...

double* value_base = &values[0];
double* x_base = &x[0];
int*    index_base = &col_indices[0];


for (int row = 0; row < n_rows - 1; row++) {
    int col_length   = row_ptr[row + 1] - row_ptr[row];

    __m256d rowsum = _mm256_set1_pd(0.);
    for (int col4 = 0; col4 < col_length; col4 += 4) {
        // Load indices for x vector(const __m128i*)
        __m128i idxreg     = _mm_load_si128((const __m128i*)index_base);
        // Load 4 doubles from x indexed by idxreg (AVX2)
        __m256d x_     = _mm256_i32gather_pd(x_base, idxreg, 8);
        // Load 4 doubles linear from memory (value array)
        __m256d v_     = _mm256_load_pd(value_base);
        // FMA: rowsum += x_ * v_
        rowsum = _mm256_fmadd_pd(x_, v_, rowsum);

        index_base += 4;
        value_base += 4;
    }
    __m256d s = _mm256_hadd_pd(rowsum, rowsum);
    result[row] = ((double*)&s)[0] + ((double*)&s)[2];
    // Alternative (not faster):
    // Now we split the upper and lower AVX register, and do a number of horizontal adds
    //__m256d hsum = _mm256_add_pd(rowsum, _mm256_permute2f128_pd(rowsum, rowsum, 0x1));
    //_mm_store_sd(&result[row], _mm_hadd_pd( _mm256_castpd256_pd128(hsum), _mm256_castpd256_pd128(hsum) ) );
}

Any suggestions are welcome.

Thanks a lot, Chris

like image 523
chhu79 Avatar asked Jul 15 '15 12:07

chhu79


1 Answers

Gather on Haswell is slow. I implemented an 8-bit-index LUT lookup of 16bit values (for GF16 multiply for par2) in a few different ways, to find out what's fastest. On Haswell, the VPGATHERDD version took 1.7x as long as the movd / pinsrw version. (Only a couple VPUNPCK / shift instructions were needed beyond the gathers.) code here, if anyone wants to run the benchmark.

As is common when an instruction is first introduced, they don't dedicate a huge amount of silicon to making it super-fast. It's there just to get HW support out there, so code can be written to use it. For ideal performance on all CPUs, then you need to do what x264 did for pshufb: have a SLOW_SHUFFLE flag for CPUs like Core2, and factor that into your best-routine-finding function-pointer-setting, rather than just what insns a CPU supports.

For projects less fanatical about having asm versions tuned for every CPU they can run on, introducing a no-speedup version of an instruction will get people using it sooner, so when the next design comes along and its fast, more code speeds up. Releasing a design like Haswell where gather is actually a slowdown is a little dicey. Maybe they wanted to see how people would use it? It does increase code density, which helps when the gather isn't in a tight loop.

Broadwell is supposed to have a faster gather implementation, but I don't have access to one. The Intel manual that lists latency/throughput for instructions says Broadwell's gather is about 1.6x faster, so it would still be slightly slower than a hand-crafted loop that shifts/unpacks the indices in GP regs, and uses them for PINSRW into vectors.

If gather could take advantage of cases where multiple elements had the same index, or even an index that pointed to the same 32B fetch block, there could be some big speedups depending on the input data.

Hopefully Skylake will improve even further. I thought I'd read something saying that it would, but on checking, I didn't find anything.

RE: sparse matrices: isn't there a format that duplicates the data, so you can do contiguous reads for rows or columns? It's not something I've had to write code for, but I think I've seen mention of it in some answers.

like image 195
Peter Cordes Avatar answered Oct 21 '22 13:10

Peter Cordes