FFA Chapter 9 Homework: Comba in X86_64 Assembly

March 9th, 2019

In this post I want to share a solution to one of the homework tasks of FFA Chapter 9, dedicated to Comba multiplication algorithm. More specifically, I will show my implementation of Comba algorithm in x86_64 assembly. For the implementation, I used Ada code as a foundation. The asm code, as you will see, is a straightforward translation of the Ada code from the Comba article.

As far as integration of assembler code with Ada is concerned, I used C calling convention for the assembler function, and imported the assembler code as if it was written in C:

   procedure FZ_Mul_Comba(X     : in  FZ;
                          Y     : in  FZ;
                          XY_Lo : out FZ;
                          XY_Hi : out FZ) is

      -- Words in each multiplicand
      L : constant Word_Index := X'Length;

      -- Length of Product, i.e. double the length of either multiplicand
      LP : constant Word_Index := 2 * L;

      -- Register holding Product; indexed from zero
      XY : FZ(0 .. LP - 1);

      procedure Asm_Comba(X      : in  FZ;
                          Y      : in  FZ;
                          XY     : out FZ;
                          X_Size : in  Word_Count);
      pragma Import (C, Asm_Comba, "x86_64_comba");
   begin
      Asm_Comba(X, Y, XY, L);
      XY_Lo := XY(0 .. L - 1);
      XY_Hi := XY(L .. XY'Last);
   end FZ_Mul_Comba;

Function signature is the same, I also allocated continuous storage for the product, copying relevant parts back into two output FZs afterwards. Also, Ada functions receive size of FZs automatically, in assembler code we have to pass it manually. Thus, we additionally pass argument X_Size with length of single-width FZs.

The assembler code is in file ffa/libffa/x86_64_comba.s, a small change is necessary to make gprbuild find this file:

   type Mode_Type is ("debug", "release");
   Mode : Mode_Type := external ("mode", "release");

-  for Languages         use ("Ada");
+  for Languages         use ("Ada", "Asm");
   for Source_Dirs       use (".");
   for Library_Dir       use "lib";
   for Library_Name      use "FFA";

Now, the actual assembly code. Just as in the Stanislav's article, there are two functions: comba and col, the former using a subset of the registers of the latter. Let's have a look at the register allocation. The entry point is the comba function, which according to the ABI receives the arguments in the following registers:

  • RDI: X, array of words size X'Size elements
  • RSI: Y, array of words size X'Size elements
  • RDX: XY, array of words size 2*X'Size elements
  • RCX: X'Size, base FZ length (for X and Y).

On the other hand, the instruction we will definitely use is mul, which multiplies values in RAX and another register, putting high part of result into RDX, and the low - into RAX. This suggests that both RAX and RDX should be used as scratch registers. Thus, I put XY into another register, R13, immediately after start of the program.

Other variables in comba:

  • For accumulators A0-A2 I allocated registers R8-R10.
  • For I - RBX.
  • For input arguments of col, U and V, - R11, R12.
  • For upper bounds of the two loops in comba - RCX.

Inside col, we additionally make use of the following registers:

  • RAX and RDX are used both for reading individual words of X and Y into registers and for storing results of the multiplications.
  • R11, holding lower bound of the loop, is reused as variable J - used to calculate loop index over X and Y. Thus, comba has to recompute this index every time, it can not be reused even if it stayed invariant in a loop.

Registers RBX, RBP, RSP, R12-R15 are callee-saved in the x86_64 SysV ABI, we have to save them if we intend to change them.

With this in mind, first let's inspect the col function:

col:
cmp r11, r12             # exit when J > V
jg col_output            # ...
lea rdx, [rsi + 8*rbx]   # rdx := Y'Address + N*8
lea rax, [8*r11]         # rax := j
sub rdx, rax             # rdx := rdx - j*8
mov rdx, [rdx]           # rdx := *(rdx)
mov rax, [rdi + 8*r11]   # rax := X(j) := *(X'Address + j*8)
mul rdx                  # rdx:rax := rax*rdx
add r8,  rax             # A0, C := A0 + rax
adc r9,  rdx             # A1, C := A1 + rdx + C
adc r10, 0               # A2, [C=0] := A2 + 0 + C
inc r11                  # J := J + 1
jmp col
col_output:
mov [r13 + 8*rbx], r8    # XY(N) := A0
mov r8, r9               # A0 := A1
mov r9, r10              # A1 := A2
xor r10, r10             # A2 := 0
ret

Because Scale memory addressing parameter can't be negative, we cannot load a word from Y with a LEA and a MOV, instead we have do some more computations. Rest of the code is IMO straightforward.

Now, comba function:

.global x86_64_comba
x86_64_comba:
push rbx
push r12
push r13
mov r13, rdx   # RDX is used by MUL, move XY to a free register

xor r8,  r8    # A0 := 0
xor r9,  r9    # A1 := 0
xor r10, r10   # A2 := 0
xor rbx, rbx   # I  := 0

loop_1:
cmp rbx, rcx   # exit when I >= L
jge end_loop_1 # ...
xor r11, r11   # U := 0
mov r12, rbx   # V := I
call col       #
inc rbx        # I := I + 1
jmp loop_1
end_loop_1:

# rbx = L after the previous loop
lea r12, [rcx - 1]   # V = L - 1
mov rcx, r12         # RCX := L - 1
shl rcx, 1           # RCX := (L - 1)*2
loop_2:
cmp rbx, rcx         # exit when I > 2*L-2
jg end_loop_2        # ...
mov r11, rbx         # U := I
sub r11, r12         # U := I - V := I - L + 1
call col             # V already set to L - 1
inc rbx              # I := I + 1
jmp loop_2
end_loop_2:

mov [r13 + 8*rbx], r8 # XY(I) := A0

end_comba:
pop r13
pop r12
pop rbx
ret

The comments should explain the code well enough.

The functions presented are constant-time given that underlying individual instructions are constant-time, as they contain no branching on any secret values (loaded from X and Y, and written to XY).

As far as correctness is concerned, it should be correct as a trivial translation of Ada code. Potential pitfalls in assembler code could be integer overflows and underflows, but looking at the code I could not find any place where this would be possible. Additionally, it can be tested1 in the following way:

$ cat rndmul.tape
[.]?"#
[.]?"#
[*]*
[.]#[={}{[SAD-1]}_
]
[.]#[={}{[SAD-2]}_
]
$ while true; do ~/ffa_calc_9 2048 32 /dev/urandom < rndmul.tape | ~/ffa_calc_9_asm 2048 32; done

The performance improvement of using implementing Comba in assembly instead of Ada, using RSA signature verification program as a benchmark is 7% at FZ width 2048 and 4096.

I suspect that the register allocation scheme is ugly for those who programmed x86 assembly in 80ties and 90ties, and for whom "RAX" and "RSI" tells something. As for me, they could have been as well called R1-R8. But if you have a suggestion how to improve the register allocation scheme -- I'm all ears.

Vpatches:

curl 'http://bvt-trace.net/vpatches/ch9_asm_comba.vpatch' > ch9_asm_comba.vpatch
curl 'http://bvt-trace.net/vpatches/ch9_asm_comba.vpatch.bvt.sig' > ch9_asm_comba.vpatch.bvt.sig

And seals for chapters 7-9:

curl 'http://bvt-trace.net/vpatches/ffa_ch7_turbo_egyptians.kv.vpatch.bvt.sig' > ffa_ch7_turbo_egyptians.kv.vpatch.bvt.sig
curl 'http://bvt-trace.net/vpatches/ffa_ch8_randomism.kv.vpatch.bvt.sig' > ffa_ch8_randomism.kv.vpatch.bvt.sig
curl 'http://bvt-trace.net/vpatches/ffa_ch9_exodus.kv.vpatch.bvt.sig' > ffa_ch9_exodus.kv.vpatch.bvt.sig
  1. Did not show presence of any bug, but also did not prove their abscence. []

5 Responses to “FFA Chapter 9 Homework: Comba in X86_64 Assembly”

  1. Pretty neat!

    The secret behind the "only 7%" is that the CPU cost of the Modular Exponentiation in Ch. 6 - 13 consists almost entirely of the modular reduction's cost (via Knuth division) and multiplication is not involved there.

    In 14B, where Barrett's Method is introduced, this changes, and the cost of multiplication then forms the bulk of the cost of Modular Exponentiation. To properly test the ASMism with current FFA, would have to also implement FZ_Sqr_Comba ASMistically, it is where half of the multiplications happen.

    It is also IMHO worth a try to see whether the SSE 256-bit MUL performs in constant time (and on what -- it is entirely possible that, e.g., AMD does, while Intel does not.)

  2. Also ought to add, you will probably want to play with the Karatsuba threshold (in both regular and "square case" Karatsuba) -- I suspect it will have to change in response to an ASMistic MUL.

  3. Further, re: SSE: observe that the minimal permitted FZ bitness is 256. If the Karatsuba threshhold is set such that 256-bit FZ goes straight to the base case, then one does not need Comba at all, and it can be replaced with a single 256-bit MUL instruction.

  4. bvt says:
    4

    Thanks!

    Modifying this code for LoMul and squaring should be rather easy. I guess I'll update this vpatch when I get to Ch.14.

    Re SSE, IIRC there is (was?) no instruction to multiply bignums. SSE typically does various operations on number vectors (dot product/per-element multiplications) -- and at least cursory look at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ supports this. There is another question whether any of the available SSE instructions can be used to speed up FFA (if constant-time).

  5. Digging in the docs seems to confirm! -- I have no idea where I got the notion that a 256 (or even 128) bit multiplier exists on x86 iron somewhere.

    The observation re Comba stands though -- given how it is used as the Karatsuba base case, it probably ought to be replaced with a set of fully-unrolled cases.

Leave a Reply