Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing Modular Arithmetic

I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers.

The function is

void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    for(int i = 0; i < dim; i++) {
        a[i] = (a[i]+b[i])%q; // LINE1
    }
}

I'm compiling with icc -std=gnu99 -O3 (icc so I can use SVML later) on an IvyBridge (SSE4.2 and AVX, but not AVX2).

My baseline is removing the %q from LINE1. 100 (iterated) function calls with dim=11221184 takes 1.6 seconds. ICC auto-vectorizes the code for SSE; great.

I really want to do modular additions though. With the %q, ICC does not auto-vectorize the code, and it runs in 11.8 seconds(!). Even ignoring the auto-vectorization for the previous attempt, this still seems excessive.

Since I don't have AVX2, vectorization with SSE requires SVML, which is perhaps why ICC didn't auto-vectorize. At any rate, here's my attempt to vectorize the inner loop:

__m128i qs = _mm_set1_epi64x(q);
for(int i = 0; i < dim; i+=2) {
    __m128i xs = _mm_load_si128((const __m128i*)(a+i));
    __m128i ys = _mm_load_si128((const __m128i*)(b+i));
    __m128i zs = _mm_add_epi64(xs,ys);
    zs = _mm_rem_epi64(zs,qs);
    _mm_store_si128((__m128i*)(a+i),zs);
}

Assembly for the main loop is:

..B3.4:                         # Preds ..B3.2 ..B3.12
    movdqa    (%r12,%r15,8), %xmm0                          #59.22
    movdqa    %xmm8, %xmm1                                  #60.14
    paddq     (%r14,%r15,8), %xmm0                          #59.22
    call      __svml_i64rem2                                #61.9
    movdqa    %xmm0, (%r12,%r15,8)                          #61.36
    addq      $2, %r15                                      #56.30
    cmpq      %r13, %r15                                    #56.24
    jl        ..B3.4        # Prob 82%                      #56.24

So the code is getting vectorized as expected. I know I might not get a 2x speedup due to SVML, but the code runs in 12.5 seconds, slower than with no vectorization at all! Is this really the best that can be done here?

like image 917
crockeea Avatar asked Dec 16 '13 06:12

crockeea


People also ask

What does a ≡ b mod n mean?

For a positive integer n, two integers a and b are said to be congruent modulo n (or a is congruent to b modulo n), if a and b have the same remainder when divided by n (or equivalently if a − b is divisible by n ). It can be expressed as a ≡ b mod n. n is called the modulus.

What is modular arithmetic formula?

Let be a positive integer. We denote the set [ 0. . n − 1 ] by . We consider two integers to be the same if and differ by a multiple of , and we write this as x = y ( mod n ) , and say that and are congruent modulo . We may omit when it is clear from context.

What does mod 3 mean in math?

The modulo operation (abbreviated “mod”, or “%” in many programming languages) is the remainder when dividing. For example, “5 mod 3 = 2” which means 2 is the remainder when you divide 5 by 3.

What does mod n mean?

Given two positive numbers a and n, a modulo n (often abbreviated as a mod n) is the remainder of the Euclidean division of a by n, where a is the dividend and n is the divisor.

What is modular arithmetic?

Last Updated : 04 May, 2020 Modular arithmetic is the branch of arithmetic mathematics related with the “mod” functionality. Basically, modular arithmetic is related with computation of “mod” of expressions. Expressions may have digits and computational symbols of addition, subtraction, multiplication, division or any other.

What is the vectorization of a matrix?

In mathematics, especially in linear algebra and matrix theory, the vectorization of a matrix is a linear transformation which converts the matrix into a column vector. Specifically, the vectorization of a m × n matrix A, denoted vec ( A ), is the mn × 1 column vector obtained by stacking the columns of the matrix A on top of one another:

Why vectorize machine learing algorithms?

In order to fully take advantage of computation power of today’s computers, the state of art of implementation of algorithm is vectorizing all the computations. This allows you to achieve parallelized computation, for example fully use the processors of GPU. In this post, the implementation of vectorization of machine learing is introduced.

What is modular multiplication?

Modular multiplication appears in many fields of mathematics and has many far-ranging applications, including cryptography, computer science, and computer algebra. Properties of multiplication in modular arithmetic:


1 Answers

Neither SSE2 nor AVX2 have integer division instructions. Intel is disingenuous to call the SVML functions intrinsics since many of them are complicated functions which map to several instructions and not just a few.

There is a way to do faster division (and modulo) with SSE2 or AVX2. See this paper Improved division by invariant integers. Basically you precompute a divisor and then do multiplication. Precomputing the divisor takes time but for some value of dim in your code it should win out. I described this method in more detail here SSE integer division? I also successfully implemented this method in a prime number finder Finding lists of prime numbers with SIMD - SSE/AVX

Agner Fog implements 32-bit (but not 64-bit) division in his Vector Class using the method described in that paper. That would be a good place to start if you want some code but you will have to extend it to 64-bit.

Edit: Based on Mysticial's comments and assuming that the inputs are already reduced I produced a version for SSE. If this is compiled in MSVC then it needs to be in 64 bit mode as 32 bit mode does not support _mm_set1_epi64x. This can be fixed for 32 bit mode mode but I don't want to do it.

#ifdef _MSC_VER 
#include <intrin.h>
#endif
#include <nmmintrin.h>                 // SSE4.2
#include <stdint.h>
#include <stdio.h>

void addRq_SSE(int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    __m128i q2 = _mm_set1_epi64x(q);
    __m128i t2 = _mm_sub_epi64(q2,_mm_set1_epi64x(1));
    for(int i = 0; i < dim; i+=2) {
        __m128i a2 = _mm_loadu_si128((__m128i*)&a[i]);
        __m128i b2 = _mm_loadu_si128((__m128i*)&b[i]);
        __m128i c2 = _mm_add_epi64(a2,b2);
        __m128i cmp = _mm_cmpgt_epi64(c2, t2);
        c2 = _mm_sub_epi64(c2, _mm_and_si128(q2,cmp));
        _mm_storeu_si128((__m128i*)&a[i], c2);
    }
}

int main() {
    const int64_t dim = 20;
    int64_t a[dim];
    int64_t b[dim];
    int64_t q = 10;

    for(int i=0; i<dim; i++) {
        a[i] = i%q; b[i] = i%q;
    }
    addRq_SSE(a, b, dim, q);
    for(int i=0; i<dim; i++) {
        printf("%d\n", a[i]);
    }   
}
like image 158
Z boson Avatar answered Oct 10 '22 18:10

Z boson