Unrolled x86_64 Assembly Comba for Ch. 11 FFA

May 12th, 2019

This post continues the work in previous post, this time unrolling the code for additional performance.

On the Ada size, I have added the unrolled code invocation as a fastpath for a specific size in FZ_Multiply_Unbuffered: this way, no additional restrictions are put on the library users1. I have inserted a safety check for FZ size:

   -- Comba's multiplier fastpath. (CAUTION: UNBUFFERED)
   procedure FZ_Mul_Comba_Fast(X     : in  FZ;
                               Y     : in  FZ;
                               XY    : out FZ)
   is
      procedure Asm_Comba(X      : in  FZ;
                          Y      : in  FZ;
                          XY     : out FZ;
                          L      : in  Word_Index);
      pragma Import (C, Asm_Comba, "x86_64_comba_unrolled");
   begin
      pragma Assert(X'Length = Karatsuba_Thresh and
                      Y'Length = Karatsuba_Thresh and
                      XY'Length = 2*Karatsuba_Thresh);
      Asm_Comba(X, Y, XY, X'Length);
   end FZ_Mul_Comba_Fast;

   -- ...

   -- Multiplier. (CAUTION: UNBUFFERED)
   procedure FZ_Multiply_Unbuffered(X     : in  FZ;
                                    Y     : in  FZ;
                                    XY    : out FZ) is

      -- The length of either multiplicand
      L : constant Word_Count := X'Length;

   begin

      if L = Karatsuba_Thresh then

         -- Optimized case:
         FZ_Mul_Comba_Fast(X, Y, XY);
      elsif L < Karatsuba_Thresh then

	 -- Base case
	 FZ_Mul_Comba(X, Y, XY);
      else

         -- Recursive case:
         Mul_Karatsuba(X, Y, XY);

      end if;

   end FZ_Multiply_Unbuffered;

... and added a comment near Karatsuba_Threshold for syncing changes to Ada and Assembly constants:

   -- Karatsuba Threshhold - at or below this many Words, we use Comba mult.
   -- Edit the Karatsuba_Thresh in x86_64_comba.s as well after changing this
   -- value.
   Karatsuba_Thresh : constant Indices := 32;

On the assembly size, I have used GNU assembler macros for the implementation: while this made the structure a bit more convoluted, macros allow easily changing the unrolling factor, which I will use in the evaluation. So, lets have a look at the macros.

The first set of macros is gen_col_inner, col_finish, and gen_col. gen_col_inner generates the necessary number of inner loops of Col function. The code is reused from the previous version, the register allocating is the same. The arguments to the macro control the number of iterations that must be generated2. col_finish writes the final product computed in the A0 accumulator to result FZ, shifts the accumulators, increments the 'current column' register. Here comes a small change in register allocation: RCX register is now used for XY FZ instead of R13. Previously it was used for storing nominal FZ length/upper loop bounds, which are now also known at compile-time. gen_col just provides a convenient wrapper over two previous macros.

.macro gen_col_inner I NIter
.if \NIter - \I
gen_col_inner "(\I + 1)" \NIter
.endif
lea rdx, [rsi + 8*rbx]   # rdx := Y'Address + N*8
lea rax, [8*r11]         # rax := 8*j
sub rdx, rax             # rdx := rdx - j*8
mov rdx, [rdx]           # rdx := *(rdx)
mov rax, [rdi + 8*r11]   # rax := X(j) := *(X'Address + j*8)
mul rdx                  # rdx:rax := rax*rdx
add r8,  rax             # A0, C := A0 + rax
adc r9,  rdx             # A1, C := A1 + rdx + C
adc r10, 0               # A2, [C=0] := A2 + 0 + C
inc r11                  # J := J + 1
.endm

.macro col_finish
mov [rcx + 8*rbx], r8    # XY(N) := A0
mov r8, r9               # A0 := A1
mov r9, r10              # A1 := A2
xor r10, r10             # A2 := 0
inc rbx                  # N  := N + 1
.endm

.macro gen_col NIter
gen_col_inner 0 \NIter
col_finish
.endm

The next set of macros are gen_loop_low, gen_loop_high_inner, and gen_loop_high. These macros prepare the arguments for the col function, and generate the unrolled equivalent of this function using gen_col macro. While gen_loop_low is trivial, the gen_loop_high is a bit more involved: the U argument of col function changes in range 1..L-1 inside the loop, while V is fixed to L-1; the number of col loops that must be generated inside Ith iteration of gen_high_loop is thus V-U = V - I = L - 1 - I.

Both low and high loop must provide the necessary N and U run-time value to the col function bodies. While N is changing uniformly from 1 to 2*L and is handled in the col_finish, and the U value for the low part loop is generated by zeroing the R11 register, for the high loop I have used an additional register R12. It is used to calculate the value passed to col code in the R11 register.

.macro gen_loop_low L
.if \L
gen_loop_low "(\L-1)"
xor r11, r11            # U := 0
gen_col \L-1
.endif
.endm

.macro gen_loop_high_inner I L
.if \L-\I
inc r12                 # I := I + 1
mov r11, r12            # U := I (U in col)
gen_col "(\L-1-\I)"
gen_loop_high_inner "(\I+1)" \L
.endif
.endm

.macro gen_loop_high L
gen_loop_high_inner 1 \L
.endm

These macros are used by the x86_64_comba_unrolled function. The function uses a separate assembly-level define for Karatsuba_Threshold, which is matched against the passed FZ size at the function entry3. After the check, the function clears the registers and generates the loops to calculate the low and the high part of the result. A final round of col_finish is necessary to write the last word of the result from the accumulator to memory.

# Arguments
# RDI: X
# RSI: Y
# RDX: XY
# RCX: L
.global x86_64_comba_unrolled
x86_64_comba_unrolled:
push rbx
push r12

cmp rcx, Karatsuba_Thresh
jne size_fail

mov rcx, rdx   # RCX := XY
xor r12, r12   # TMP := 0
xor r8,  r8    # A0  := 0
xor r9,  r9    # A1  := 0
xor r10, r10   # A2  := 0
xor rbx, rbx   # N   := 0

gen_loop_low Karatsuba_Thresh
gen_loop_high Karatsuba_Thresh
col_finish

pop r12
pop rbx
ret

size_fail:
ud2

Evaluating this code is a bit tricky: for Chapter 9 code, I have used the signature verification benchmark. However, after introduction of Karatsuba multiplication the total percentage of cycles spent in the multiplication is reduced to approximately 1%, rendering this tape useless as a benchmark. Therefore, I have prepared a trivial microbenchmark for multiplication. I will make the final conclusion about which unrolled variant is really worth using after I get to Chapter 14.

The evaluation results are available over the table:

Variant Bitness /
Runtime (s)
2048 4096 8192
Orig. Ch.11 0.621 1.919 5.894
Asm-Loops
(8 thres.)
0.211 0.702 2.261
Asm-8 0.201 0.673 2.168
Asm-Loops
(16 thres.)
0.151 0.526 1.737
Asm-16 0.145 0.506 1.657
Asm-Loops
(32 thres.)
0.141 0.497 1.638
Asm-32 0.128 0.461 1.551

where Asm-Loops means assembly code with loops from my previous vpatch adapted to Chapter 11 (with different Karatsuba threshold points: 8, 16, 32), and Asm-X means unrolled comba assembly for Karatsuba threshold X. One can see that using unrolled comba provides benefits over all loop unroll values. It may be beneficial to unroll the code even further, however this will require additional experiments: at Karatsuba_Threshold set to 32 and FZ width of 2048 bits multiplications are calculated only by assembly comba - maybe it's better to translate to assembly Karatsuba algorithm instead?

My seals for FFA Chapters 10 and 11:

curl 'http://bvt-trace.net/vpatches/ffa_ch10_karatsuba.kv.vpatch.bvt.sig' > ffa_ch10_karatsuba.kv.vpatch.bvt.sig
curl 'http://bvt-trace.net/vpatches/ffa_ch11_tuning_and_api.kv.vpatch.bvt.sig' > ffa_ch11_tuning_and_api.kv.vpatch.bvt.sig

The vpatch and the seal for current chapter are available via the links below. The vpatch requires all previous FFA vpatches up to Chapter 11.

curl 'http://bvt-trace.net/vpatches/ch11_unrolled_asm_comba.vpatch' > ch11_unrolled_asm_comba.vpatch
curl 'http://bvt-trace.net/vpatches/ch11_unrolled_asm_comba.vpatch.bvt.sig' > ch11_unrolled_asm_comba.vpatch.bvt.sig
  1. Otherwise, I'd have to increase minimum ffacalc bitness to 512 for Karatsuba_Threshold 8, to 1024 for 16, to 2048 for 32. TBH, I don't quite like the fastpath variant - having two versions of the same code in a codebase definitely has some smell. I will decide if it is worth just increasing the minimal bitness when I get to Chapter 14. []
  2. In all cases this number is known at compile-time. []
  3. This check has low cost, but a potential to prevent severe consequences in case of developer neglect. []