Will the GNAT ADA compiler inline a function when using Pragma Inline_Always? and proves the dump of a function a red herring? In short, yes it will and no it doesn't. The Pragma Inline_Always needs to be put in the specification files ('*.ads') and not in the implementation files ('*.adb'). The dump of a
function in a binary can accurately show if inlining was performed or not.
To run the code, use the following v-patch on the code in FFA chapter 11;
- The patch: ffa_ch1_inline_demonstration.vpatch.
- A signature: ffa_ch1_inline_demonstration.vpatch.ave1.sig.
We start with the preliminaries, as for the gnat version (gprbuild -version):
GPRBUILD GPL 2016 (20160515) (x86_64-pc-linux-gnu) Copyright (C) 2004-2016, AdaCore This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In FFA almost all functions have the pragma Inline_Always adornment. According to the gcc 4.9.4 documentation: Similar to pragma Inline except that inlining is not subject to the use of option -gnatn or -gnatN and the inlining happens regardless of whether these options are used.
All of the code is based on a copy of the ffademo files provided in the FFA chapter 1 genesis.
I base the tests on the FZ_Add procedure and use it to implement the slowest possible multiplication function. This multiplication function is specified to multiply a FZ by a natural number. To multiply we will add the input FZ to a tally as many times as given by the natural number.
procedure Mul_By_Sum (N : in Natural; X : in FZ; Z : out FZ; Overflow : out WBool) is C : WBool := 0; begin Z := (others => 0); for i in 1 .. N loop FZ_Add (X, Z, Z, C); end loop; Overflow := C; end Mul_By_Sum; procedure Inline_Experiment is X : FZ (1 .. 256) := (others => 0); Z : FZ (1 .. 256) := (others => 0); -- Carry. C : WBool := 0; N : Indices := 256; begin for i in 1 .. N loop X (i) := Word (i); end loop; Mul_By_Sum (10000000, X, Z, C); Dump (Z); end Inline_Experiment;
The experiment is divided over 3 programs, each with their own set of files;
- inline_exp1_main; the FZ_Add and W_Carry procedures from libffa are used as is.
- inline_exp2_main; the FZ_Add and W_Carry procedures are copied and included in the body of the module. For clarity we add a '2' to the procedure names.
- inline_exp3_main: the FZ_Add and W_Carry functions are copied to a separate module and pragma Inline_Always is included in the specification file. For clarity we add a '3' to the procedure names.
Read the code in ffainline (except for the mul_by_sum these are all either calls to ffalib or copies of the code from ffalib), build and run everything with:
cd ffa/ffainline
gprbuild
time ./bin/inline_exp1_main > exp1.txt
time ./bin/inline_exp2_main > exp2.txt
time ./bin/inline_exp3_main > exp3.txt
diff exp0.txt exp1.txt
diff exp0.txt exp2.txt
diff exp0.txt exp3.txt
The diffs should be empty, and the timings should differ. A table with
my timings;
Experiment | Time (seconds) |
---|---|
1 | 8.579 |
2 | 4.924 |
3 | 4.871 |
The second and third versions are a little more than 40% faster than the first version. The only difference between the codes is where the procedures are defined, so the timing difference is all due to inlining. Let's check by disassembling the code with objdump. As everything will be disassembled by objdump, we need a small script to return the just code for one function;
#!/bin/sh FUNCTIONLINE=`objdump -t $1 | sort | nl | grep $2` LINENR=`echo $FUNCTIONLINE | awk '{print $1}'` NEXTLINENR=`expr $LINENR + 1` NEXTLINE=`objdump -t $1 | sort | nl -s ' @ ' | grep $NEXTLINENR @ ` START=`echo $FUNCTIONLINE | awk '{print $2}'` STOP=`echo $NEXTLINE | awk '{print $3}'` objdump --start-address=0x$START --stop-address=0x$STOP -S -d $1
The dump for the first version of mul by sum2.
./dump-function.sh inline_exp1_main inline_exp1__mul_by_sum
inline_exp1_main: file format elf64-x86-64 Disassembly of section .text: 0000000000401430 <inline_exp1__mul_by_sum>: 401430: 55 push %rbp 401431: 48 89 e5 mov %rsp,%rbp 401434: 41 57 push %r15 401436: 41 56 push %r14 401438: 41 55 push %r13 40143a: 41 54 push %r12 40143c: 53 push %rbx 40143d: 48 81 ec 38 10 00 00 sub $0x1038,%rsp 401444: 48 83 0c 24 00 orq $0x0,(%rsp) 401449: 48 81 c4 20 10 00 00 add $0x1020,%rsp 401450: 49 63 40 04 movslq 0x4(%r8),%rax 401454: 49 89 cc mov %rcx,%r12 401457: 49 63 08 movslq (%r8),%rcx 40145a: 49 89 d7 mov %rdx,%r15 40145d: 31 d2 xor %edx,%edx 40145f: 41 89 fd mov %edi,%r13d 401462: 48 89 75 c8 mov %rsi,-0x38(%rbp) 401466: 4c 89 c3 mov %r8,%rbx 401469: 39 c1 cmp %eax,%ecx 40146b: 7f 1c jg 401489 <inline_exp1__mul_by_sum+0x59> 40146d: 48 29 c8 sub %rcx,%rax 401470: 48 8d 14 c5 08 00 00 lea 0x8(,%rax,8),%rdx 401477: 00 401478: 48 b8 00 00 00 00 04 movabs $0x400000000,%rax 40147f: 00 00 00 401482: 48 39 c2 cmp %rax,%rdx 401485: 48 0f 47 d0 cmova %rax,%rdx 401489: 31 f6 xor %esi,%esi 40148b: 4c 89 e7 mov %r12,%rdi 40148e: e8 bd 80 03 00 callq 439550 <memset> 401493: 45 85 ed test %r13d,%r13d 401496: 7e 38 jle 4014d0 <inline_exp1__mul_by_sum+0xa0> 401498: 45 31 f6 xor %r14d,%r14d 40149b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 4014a0: 48 8b 7d c8 mov -0x38(%rbp),%rdi 4014a4: 41 83 c6 01 add $0x1,%r14d 4014a8: 4d 89 e0 mov %r12,%r8 4014ab: 49 89 d9 mov %rbx,%r9 4014ae: 4c 89 e2 mov %r12,%rdx 4014b1: 48 89 d9 mov %rbx,%rcx 4014b4: 4c 89 fe mov %r15,%rsi 4014b7: e8 c4 00 00 00 callq 401580 <fz_arith__fz_add> 4014bc: 45 39 f5 cmp %r14d,%r13d 4014bf: 75 df jne 4014a0 <inline_exp1__mul_by_sum+0x70> 4014c1: 48 83 c4 18 add $0x18,%rsp 4014c5: 5b pop %rbx 4014c6: 41 5c pop %r12 4014c8: 41 5d pop %r13 4014ca: 41 5e pop %r14 4014cc: 41 5f pop %r15 4014ce: 5d pop %rbp 4014cf: c3 retq 4014d0: 31 c0 xor %eax,%eax 4014d2: 48 83 c4 18 add $0x18,%rsp 4014d6: 5b pop %rbx 4014d7: 41 5c pop %r12 4014d9: 41 5d pop %r13 4014db: 41 5e pop %r14 4014dd: 41 5f pop %r15 4014df: 5d pop %rbp 4014e0: c3 retq 4014e1: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 4014e8: 00 00 00 4014eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Can you see the callq 401580 <fz_arith__fz_add> in there? clearly the FZ_Add was not inlined. Now for the second version;
./dump-function.sh inline_exp2_main inline_exp2__mul_by_sum
inline_exp2_main: file format elf64-x86-64 Disassembly of section .text: 0000000000401420 <inline_exp2__mul_by_sum>: 401420: 55 push %rbp 401421: 48 89 e5 mov %rsp,%rbp 401424: 41 57 push %r15 401426: 41 56 push %r14 401428: 41 55 push %r13 40142a: 41 54 push %r12 40142c: 53 push %rbx 40142d: 48 81 ec 38 10 00 00 sub $0x1038,%rsp 401434: 48 83 0c 24 00 orq $0x0,(%rsp) 401439: 48 81 c4 20 10 00 00 add $0x1020,%rsp 401440: 49 63 40 04 movslq 0x4(%r8),%rax 401444: 49 89 cf mov %rcx,%r15 401447: 49 63 08 movslq (%r8),%rcx 40144a: 49 89 d6 mov %rdx,%r14 40144d: 31 d2 xor %edx,%edx 40144f: 89 fb mov %edi,%ebx 401451: 49 89 f4 mov %rsi,%r12 401454: 4d 89 c5 mov %r8,%r13 401457: 39 c1 cmp %eax,%ecx 401459: 7f 1c jg 401477 <inline_exp2__mul_by_sum+0x57> 40145b: 48 29 c8 sub %rcx,%rax 40145e: 48 8d 14 c5 08 00 00 lea 0x8(,%rax,8),%rdx 401465: 00 401466: 48 b8 00 00 00 00 04 movabs $0x400000000,%rax 40146d: 00 00 00 401470: 48 39 c2 cmp %rax,%rdx 401473: 48 0f 47 d0 cmova %rax,%rdx 401477: 31 f6 xor %esi,%esi 401479: 4c 89 ff mov %r15,%rdi 40147c: e8 af 7e 03 00 callq 439330 <memset> 401481: 85 db test %ebx,%ebx 401483: 0f 8e db 00 00 00 jle 401564 <inline_exp2__mul_by_sum+0x144> 401489: 49 63 06 movslq (%r14),%rax 40148c: 45 8b 55 00 mov 0x0(%r13),%r10d 401490: 45 8b 5d 04 mov 0x4(%r13),%r11d 401494: 45 8b 76 04 mov 0x4(%r14),%r14d 401498: 49 89 c5 mov %rax,%r13 40149b: 48 89 45 c0 mov %rax,-0x40(%rbp) 40149f: 49 63 c2 movslq %r10d,%rax 4014a2: 4c 89 ef mov %r13,%rdi 4014a5: 48 29 c7 sub %rax,%rdi 4014a8: 49 8d 04 ff lea (%r15,%rdi,8),%rax 4014ac: 45 31 ff xor %r15d,%r15d 4014af: 48 89 45 c8 mov %rax,-0x38(%rbp) 4014b3: 4c 89 e8 mov %r13,%rax 4014b6: 48 f7 d8 neg %rax 4014b9: 49 8d 3c c4 lea (%r12,%rax,8),%rdi 4014bd: 4d 63 e6 movslq %r14d,%r12 4014c0: 41 83 c7 01 add $0x1,%r15d 4014c4: 45 39 ee cmp %r13d,%r14d 4014c7: 0f 8c 93 00 00 00 jl 401560 <inline_exp2__mul_by_sum+0x140> 4014cd: 48 8b 45 c0 mov -0x40(%rbp),%rax 4014d1: 4c 8b 4d c8 mov -0x38(%rbp),%r9 4014d5: 31 f6 xor %esi,%esi 4014d7: 4c 8d 40 ff lea -0x1(%rax),%r8 4014db: 48 89 f0 mov %rsi,%rax 4014de: 66 90 xchg %ax,%ax 4014e0: 49 83 c0 01 add $0x1,%r8 4014e4: 45 39 c3 cmp %r8d,%r11d 4014e7: 4a 8b 0c c7 mov (%rdi,%r8,8),%rcx 4014eb: 7c 53 jl 401540 <inline_exp2__mul_by_sum+0x120> 4014ed: 45 39 c2 cmp %r8d,%r10d 4014f0: 7f 4e jg 401540 <inline_exp2__mul_by_sum+0x120> 4014f2: 49 8b 11 mov (%r9),%rdx 4014f5: 48 01 c8 add %rcx,%rax 4014f8: 49 83 c1 08 add $0x8,%r9 4014fc: 48 01 d0 add %rdx,%rax 4014ff: 48 89 d6 mov %rdx,%rsi 401502: 48 21 ca and %rcx,%rdx 401505: 49 89 41 f8 mov %rax,-0x8(%r9) 401509: 48 09 ce or %rcx,%rsi 40150c: 48 f7 d0 not %rax 40150f: 48 21 f0 and %rsi,%rax 401512: 48 09 d0 or %rdx,%rax 401515: 48 c1 e8 3f shr $0x3f,%rax 401519: 4d 39 e0 cmp %r12,%r8 40151c: 75 c2 jne 4014e0 <inline_exp2__mul_by_sum+0xc0> 40151e: 48 89 c6 mov %rax,%rsi 401521: 44 39 fb cmp %r15d,%ebx 401524: 75 9a jne 4014c0 <inline_exp2__mul_by_sum+0xa0> 401526: 48 89 f0 mov %rsi,%rax 401529: 48 83 c4 18 add $0x18,%rsp 40152d: 5b pop %rbx 40152e: 41 5c pop %r12 401530: 41 5d pop %r13 401532: 41 5e pop %r14 401534: 41 5f pop %r15 401536: 5d pop %rbp 401537: c3 retq 401538: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40153f: 00 401540: 45 39 d5 cmp %r10d,%r13d 401543: 7c 05 jl 40154a <inline_exp2__mul_by_sum+0x12a> 401545: 45 39 de cmp %r11d,%r14d 401548: 7e a8 jle 4014f2 <inline_exp2__mul_by_sum+0xd2> 40154a: be 2f 00 00 00 mov $0x2f,%esi 40154f: bf 80 c4 49 00 mov $0x49c480,%edi 401554: 31 c0 xor %eax,%eax 401556: e8 e1 0e 00 00 callq 40243c <__gnat_rcheck_CE_Index_Check> 40155b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 401560: 31 f6 xor %esi,%esi 401562: eb bd jmp 401521 <inline_exp2__mul_by_sum+0x101> 401564: 31 c0 xor %eax,%eax 401566: 48 83 c4 18 add $0x18,%rsp 40156a: 5b pop %rbx 40156b: 41 5c pop %r12 40156d: 41 5d pop %r13 40156f: 41 5e pop %r14 401571: 41 5f pop %r15 401573: 5d pop %rbp 401574: c3 retq 401575: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40157c: 00 00 00 40157f: 90 nop
In this version the call is gone, and the instructions that can be expected from FZ_Add and W_Carry are included. So inlining is happening and can be checked with objdump.
In the logs an assertion was made that gnat/gcc would include an inlined function for reference in the produced binary. This is not so, setting the "-ffunction-sections" flag for the compiler and combining this with the "--gc-sections" for the linker will remove all sections for symbols that are not referenced3.
How then to explain the massive timing variance when my "incorrect" inline pragmas are selectively removed?
FTR I see exactly the same effect on the dump when toggling the pragmas in their current (i.e. in the .adb bodies) locations. (This is on gnat 2016).
Which calls are you looking at? for methods in the same module the inlining pragma works. For methods outside of the module it seems to not work.
[...] ave1 carefully analyzed the executables generated by AdaCore’s x86-64 GNAT, and showed that this compiler is prone to ignore Inline pragmas unless they are specified in a particular [...]
[...] ave1 carefully analyzed the executables generated by AdaCore’s x86-64 GNAT, and showed that this compiler is prone to ignore Inline pragmas unless they are specified in a particular [...]