Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up large modular multiplication in base 2^8 without multiplier

I am currenty converting the nacl library to risc-v. I already have poly1305 working. I am trying to do this using the risc-v core instruction set, so I don't have a multiplier. The algorithm for Pol1305 is using at the moment ceil(m/16)*17*17 8-bit multiplications where m is the message length in bytes(multiplying two 2^130 integers in base 2^8 modulo 2^130-5). So I want to use a fast multiplication algorithm to keep it fast.

At the moment I have the shift-and-add algorithm working for the multiplication. However that takes 63 cycles for 8-bit values, since I need to avoid branching (timing side-channel) and thus some masking is involved which takes a few more cycles.

    andi  t2, t0, 1     //t0 is the multiplier
    sub   t2, zero, t2  //creating a mask
    and   t3, t1, t2    //applying the mask to the multiplicand
    add   a0, a0, t3    //doing the add
    srli  t0, t0, 1     //shifting the multiplier
    slli  t1, t1, 1     //shifting the multiplicand

This is giving me valid results at 63 cycles per multiplication. The issue is that the total execution time of the program is 175219 cycles for a message of 131 bytes. Of this time 9*17*17*63 = 163863 cycles are used for multiplication. Which I would like to improve.

like image 713
Stefan vd Berg Avatar asked Nov 01 '19 14:11

Stefan vd Berg


1 Answers

Here is a little improved code. This is based on the algorithm shown in Patterson and Hennessy's textbook.

    // initialization
    add   a0, zero, t0  //copy multiplier to the product
    slli  t1, t1, 8     //shift the multiplicand to the left half of a 16bit register

    // repeat 8 times from here
    andi  t2, a0, 1     //the right half of a0 is the multiplier
    sub   t2, zero, t2  //creating a mask
    and   t3, t1, t2    //applying the mask to the multiplicand
    add   a0, a0, t3    //doing the add to the left half of the product
    srli  a0, a0, 1     //shifting the product
    // to here

Also, you can apply the loop-unfolding method by repeating the above code 8 times instead of looping by branch/jump.

On the algorithm level, Karatsuba algorithm can reduce the number of 8bit multiplications in multi-precision arithmetic.

like image 174
Hiroto Kagotani Avatar answered Sep 30 '22 01:09

Hiroto Kagotani