This post continues the work in previous post, this time unrolling the code for additional performance.
On the Ada size, I have added the unrolled code invocation as a fastpath for a specific size in FZ_Multiply_Unbuffered: this way, no additional restrictions are put on the library users1. I have inserted a safety check for FZ size:
-- Comba's multiplier fastpath. (CAUTION: UNBUFFERED) procedure FZ_Mul_Comba_Fast(X : in FZ; Y : in FZ; XY : out FZ) is procedure Asm_Comba(X : in FZ; Y : in FZ; XY : out FZ; L : in Word_Index); pragma Import (C, Asm_Comba, "x86_64_comba_unrolled"); begin pragma Assert(X'Length = Karatsuba_Thresh and Y'Length = Karatsuba_Thresh and XY'Length = 2*Karatsuba_Thresh); Asm_Comba(X, Y, XY, X'Length); end FZ_Mul_Comba_Fast; -- ... -- Multiplier. (CAUTION: UNBUFFERED) procedure FZ_Multiply_Unbuffered(X : in FZ; Y : in FZ; XY : out FZ) is -- The length of either multiplicand L : constant Word_Count := X'Length; begin if L = Karatsuba_Thresh then -- Optimized case: FZ_Mul_Comba_Fast(X, Y, XY); elsif L < Karatsuba_Thresh then -- Base case FZ_Mul_Comba(X, Y, XY); else -- Recursive case: Mul_Karatsuba(X, Y, XY); end if; end FZ_Multiply_Unbuffered;
... and added a comment near Karatsuba_Threshold for syncing changes to Ada and Assembly constants:
-- Karatsuba Threshhold - at or below this many Words, we use Comba mult. -- Edit the Karatsuba_Thresh in x86_64_comba.s as well after changing this -- value. Karatsuba_Thresh : constant Indices := 32;
On the assembly size, I have used GNU assembler macros for the implementation: while this made the structure a bit more convoluted, macros allow easily changing the unrolling factor, which I will use in the evaluation. So, lets have a look at the macros.
The first set of macros is gen_col_inner, col_finish, and gen_col. gen_col_inner generates the necessary number of inner loops of Col function. The code is reused from the previous version, the register allocating is the same. The arguments to the macro control the number of iterations that must be generated2. col_finish writes the final product computed in the A0 accumulator to result FZ, shifts the accumulators, increments the 'current column' register. Here comes a small change in register allocation: RCX register is now used for XY FZ instead of R13. Previously it was used for storing nominal FZ length/upper loop bounds, which are now also known at compile-time. gen_col just provides a convenient wrapper over two previous macros.
.macro gen_col_inner I NIter .if \NIter - \I gen_col_inner "(\I + 1)" \NIter .endif lea rdx, [rsi + 8*rbx] # rdx := Y'Address + N*8 lea rax, [8*r11] # rax := 8*j sub rdx, rax # rdx := rdx - j*8 mov rdx, [rdx] # rdx := *(rdx) mov rax, [rdi + 8*r11] # rax := X(j) := *(X'Address + j*8) mul rdx # rdx:rax := rax*rdx add r8, rax # A0, C := A0 + rax adc r9, rdx # A1, C := A1 + rdx + C adc r10, 0 # A2, [C=0] := A2 + 0 + C inc r11 # J := J + 1 .endm .macro col_finish mov [rcx + 8*rbx], r8 # XY(N) := A0 mov r8, r9 # A0 := A1 mov r9, r10 # A1 := A2 xor r10, r10 # A2 := 0 inc rbx # N := N + 1 .endm .macro gen_col NIter gen_col_inner 0 \NIter col_finish .endm
The next set of macros are gen_loop_low, gen_loop_high_inner, and gen_loop_high. These macros prepare the arguments for the col function, and generate the unrolled equivalent of this function using gen_col macro. While gen_loop_low is trivial, the gen_loop_high is a bit more involved: the U argument of col function changes in range 1..L-1 inside the loop, while V is fixed to L-1; the number of col loops that must be generated inside Ith iteration of gen_high_loop is thus V-U = V - I = L - 1 - I.
Both low and high loop must provide the necessary N and U run-time value to the col function bodies. While N is changing uniformly from 1 to 2*L and is handled in the col_finish, and the U value for the low part loop is generated by zeroing the R11 register, for the high loop I have used an additional register R12. It is used to calculate the value passed to col code in the R11 register.
.macro gen_loop_low L .if \L gen_loop_low "(\L-1)" xor r11, r11 # U := 0 gen_col \L-1 .endif .endm .macro gen_loop_high_inner I L .if \L-\I inc r12 # I := I + 1 mov r11, r12 # U := I (U in col) gen_col "(\L-1-\I)" gen_loop_high_inner "(\I+1)" \L .endif .endm .macro gen_loop_high L gen_loop_high_inner 1 \L .endm
These macros are used by the x86_64_comba_unrolled function. The function uses a separate assembly-level define for Karatsuba_Threshold, which is matched against the passed FZ size at the function entry3. After the check, the function clears the registers and generates the loops to calculate the low and the high part of the result. A final round of col_finish is necessary to write the last word of the result from the accumulator to memory.
# Arguments # RDI: X # RSI: Y # RDX: XY # RCX: L .global x86_64_comba_unrolled x86_64_comba_unrolled: push rbx push r12 cmp rcx, Karatsuba_Thresh jne size_fail mov rcx, rdx # RCX := XY xor r12, r12 # TMP := 0 xor r8, r8 # A0 := 0 xor r9, r9 # A1 := 0 xor r10, r10 # A2 := 0 xor rbx, rbx # N := 0 gen_loop_low Karatsuba_Thresh gen_loop_high Karatsuba_Thresh col_finish pop r12 pop rbx ret size_fail: ud2
Evaluating this code is a bit tricky: for Chapter 9 code, I have used the signature verification benchmark. However, after introduction of Karatsuba multiplication the total percentage of cycles spent in the multiplication is reduced to approximately 1%, rendering this tape useless as a benchmark. Therefore, I have prepared a trivial microbenchmark for multiplication. I will make the final conclusion about which unrolled variant is really worth using after I get to Chapter 14.
The evaluation results are available over the table:
Variant | Bitness / Runtime (s) |
||
2048 | 4096 | 8192 | |
Orig. Ch.11 | 0.621 | 1.919 | 5.894 |
Asm-Loops (8 thres.) |
0.211 | 0.702 | 2.261 |
Asm-8 | 0.201 | 0.673 | 2.168 |
Asm-Loops (16 thres.) |
0.151 | 0.526 | 1.737 |
Asm-16 | 0.145 | 0.506 | 1.657 |
Asm-Loops (32 thres.) |
0.141 | 0.497 | 1.638 |
Asm-32 | 0.128 | 0.461 | 1.551 |
where Asm-Loops means assembly code with loops from my previous vpatch adapted to Chapter 11 (with different Karatsuba threshold points: 8, 16, 32), and Asm-X means unrolled comba assembly for Karatsuba threshold X. One can see that using unrolled comba provides benefits over all loop unroll values. It may be beneficial to unroll the code even further, however this will require additional experiments: at Karatsuba_Threshold set to 32 and FZ width of 2048 bits multiplications are calculated only by assembly comba - maybe it's better to translate to assembly Karatsuba algorithm instead?
My seals for FFA Chapters 10 and 11:
curl 'http://bvt-trace.net/vpatches/ffa_ch10_karatsuba.kv.vpatch.bvt.sig' > ffa_ch10_karatsuba.kv.vpatch.bvt.sig curl 'http://bvt-trace.net/vpatches/ffa_ch11_tuning_and_api.kv.vpatch.bvt.sig' > ffa_ch11_tuning_and_api.kv.vpatch.bvt.sig
The vpatch and the seal for current chapter are available via the links below. The vpatch requires all previous FFA vpatches up to Chapter 11.
curl 'http://bvt-trace.net/vpatches/ch11_unrolled_asm_comba.vpatch' > ch11_unrolled_asm_comba.vpatch curl 'http://bvt-trace.net/vpatches/ch11_unrolled_asm_comba.vpatch.bvt.sig' > ch11_unrolled_asm_comba.vpatch.bvt.sig
- Otherwise, I'd have to increase minimum ffacalc bitness to 512 for Karatsuba_Threshold 8, to 1024 for 16, to 2048 for 32. TBH, I don't quite like the fastpath variant - having two versions of the same code in a codebase definitely has some smell. I will decide if it is worth just increasing the minimal bitness when I get to Chapter 14. [↩]
- In all cases this number is known at compile-time. [↩]
- This check has low cost, but a potential to prevent severe consequences in case of developer neglect. [↩]