Comments on: FFA Chapter 9 Homework: Comba in X86_64 Assembly

By: Stanislav Datskovskiy

Stanislav Datskovskiy — Sat, 09 Mar 2019 23:48:57 +0000

Digging in the docs seems to confirm! -- I have no idea where I got the notion that a 256 (or even 128) bit multiplier exists on x86 iron somewhere.

The observation re Comba stands though -- given how it is used as the Karatsuba base case, it probably ought to be replaced with a set of fully-unrolled cases.

By: bvt

bvt — Sat, 09 Mar 2019 22:54:49 +0000

Thanks!

Modifying this code for LoMul and squaring should be rather easy. I guess I'll update this vpatch when I get to Ch.14.

Re SSE, IIRC there is (was?) no instruction to multiply bignums. SSE typically does various operations on number vectors (dot product/per-element multiplications) -- and at least cursory look at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ supports this. There is another question whether any of the available SSE instructions can be used to speed up FFA (if constant-time).

By: Stanislav Datskovskiy

Stanislav Datskovskiy — Sat, 09 Mar 2019 22:18:23 +0000

Further, re: SSE: observe that the minimal permitted FZ bitness is 256. If the Karatsuba threshhold is set such that 256-bit FZ goes straight to the base case, then one does not need Comba at all, and it can be replaced with a single 256-bit MUL instruction.

By: Stanislav Datskovskiy

Stanislav Datskovskiy — Sat, 09 Mar 2019 22:12:04 +0000

Also ought to add, you will probably want to play with the Karatsuba threshold (in both regular and "square case" Karatsuba) -- I suspect it will have to change in response to an ASMistic MUL.

By: Stanislav Datskovskiy

Stanislav Datskovskiy — Sat, 09 Mar 2019 22:10:29 +0000

Pretty neat!

The secret behind the "only 7%" is that the CPU cost of the Modular Exponentiation in Ch. 6 - 13 consists almost entirely of the modular reduction's cost (via Knuth division) and multiplication is not involved there.

In 14B, where Barrett's Method is introduced, this changes, and the cost of multiplication then forms the bulk of the cost of Modular Exponentiation. To properly test the ASMism with current FFA, would have to also implement FZ_Sqr_Comba ASMistically, it is where half of the multiplications happen.

It is also IMHO worth a try to see whether the SSE 256-bit MUL performs in constant time (and on what -- it is entirely possible that, e.g., AMD does, while Intel does not.)