Will the GNAT ADA compiler inline a function when using Pragma Inline_Always? and proves the dump of a function a red herring? In short, yes it will and no it doesn't. The Pragma Inline_Always needs to be put in the specification files ('*.ads') and not in the implementation files ('*.adb'). The dump of a
function in a binary can accurately show if inlining was performed or not.
To run the code, use the following v-patch on the code in FFA chapter 1;
We start with the preliminaries, as for the gnat version (gprbuild -version):
GPRBUILD GPL 2016 (20160515) (x86_64-pc-linux-gnu)
Copyright (C) 2004-2016, AdaCore
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In FFA almost all functions have the pragma Inline_Always adornment. According to the gcc 4.9.4 documentation: Similar to pragma Inline except that inlining is not subject to the use of option -gnatn or -gnatN and the inlining happens regardless of whether these options are used.
All of the code is based on a copy of the ffademo files provided in the FFA chapter 1 genesis.
I base the tests on the FZ_Add procedure and use it to implement the slowest possible multiplication function. This multiplication function is specified to multiply a FZ by a natural number. To multiply we will add the input FZ to a tally as many times as given by the natural number.
procedure Mul_By_Sum
(N : in Natural;
X : in FZ;
Z : out FZ;
Overflow : out WBool)
is
C : WBool := 0;
begin
Z := (others => 0);
for i in 1 .. N loop
FZ_Add (X, Z, Z, C);
end loop;
Overflow := C;
end Mul_By_Sum;
procedure Inline_Experiment is
X : FZ (1 .. 256) := (others => 0);
Z : FZ (1 .. 256) := (others => 0);
-- Carry.
C : WBool := 0;
N : Indices := 256;
begin
for i in 1 .. N loop
X (i) := Word (i);
end loop;
Mul_By_Sum (10000000, X, Z, C);
Dump (Z);
end Inline_Experiment;
The experiment is divided over 3 programs, each with their own set of files;
- inline_exp1_main; the FZ_Add and W_Carry procedures from libffa are used as is.
- inline_exp2_main; the FZ_Add and W_Carry procedures are copied and included in the body of the module. For clarity we add a '2' to the procedure names.
- inline_exp3_main: the FZ_Add and W_Carry functions are copied to a separate module and pragma Inline_Always is included in the specification file. For clarity we add a '3' to the procedure names.
Read the code in ffainline (except for the mul_by_sum these are all either calls to ffalib or copies of the code from ffalib), build and run everything with:
cd ffa/ffainline
gprbuild
time ./bin/inline_exp1_main > exp1.txt
time ./bin/inline_exp2_main > exp2.txt
time ./bin/inline_exp3_main > exp3.txt
diff exp0.txt exp1.txt
diff exp0.txt exp2.txt
diff exp0.txt exp3.txt
The diffs should be empty, and the timings should differ. A table with
my timings;
Experiment |
Time
(seconds) |
1 |
8.579 |
2 |
4.924 |
3 |
4.871 |
The second and third versions are a little more than 40% faster than the first version. The only difference between the codes is where the procedures are defined, so the timing difference is all due to inlining. Let's check by disassembling the code with objdump. As everything will be disassembled by objdump, we need a small script to return the just code for one function;
#!/bin/sh
FUNCTIONLINE=`objdump -t $1 | sort | nl | grep $2`
LINENR=`echo $FUNCTIONLINE | awk '{print $1}'`
NEXTLINENR=`expr $LINENR + 1`
NEXTLINE=`objdump -t $1 | sort | nl -s ' @ ' | grep $NEXTLINENR @ `
START=`echo $FUNCTIONLINE | awk '{print $2}'`
STOP=`echo $NEXTLINE | awk '{print $3}'`
objdump --start-address=0x$START --stop-address=0x$STOP -S -d $1
The dump for the first version of mul by sum.
./dump-function.sh inline_exp1_main inline_exp1__mul_by_sum
inline_exp1_main: file format elf64-x86-64
Disassembly of section .text:
0000000000401430 <inline_exp1__mul_by_sum>:
401430: 55 push %rbp
401431: 48 89 e5 mov %rsp,%rbp
401434: 41 57 push %r15
401436: 41 56 push %r14
401438: 41 55 push %r13
40143a: 41 54 push %r12
40143c: 53 push %rbx
40143d: 48 81 ec 38 10 00 00 sub $0x1038,%rsp
401444: 48 83 0c 24 00 orq $0x0,(%rsp)
401449: 48 81 c4 20 10 00 00 add $0x1020,%rsp
401450: 49 63 40 04 movslq 0x4(%r8),%rax
401454: 49 89 cc mov %rcx,%r12
401457: 49 63 08 movslq (%r8),%rcx
40145a: 49 89 d7 mov %rdx,%r15
40145d: 31 d2 xor %edx,%edx
40145f: 41 89 fd mov %edi,%r13d
401462: 48 89 75 c8 mov %rsi,-0x38(%rbp)
401466: 4c 89 c3 mov %r8,%rbx
401469: 39 c1 cmp %eax,%ecx
40146b: 7f 1c jg 401489 <inline_exp1__mul_by_sum+0x59>
40146d: 48 29 c8 sub %rcx,%rax
401470: 48 8d 14 c5 08 00 00 lea 0x8(,%rax,8),%rdx
401477: 00
401478: 48 b8 00 00 00 00 04 movabs $0x400000000,%rax
40147f: 00 00 00
401482: 48 39 c2 cmp %rax,%rdx
401485: 48 0f 47 d0 cmova %rax,%rdx
401489: 31 f6 xor %esi,%esi
40148b: 4c 89 e7 mov %r12,%rdi
40148e: e8 bd 80 03 00 callq 439550 <memset>
401493: 45 85 ed test %r13d,%r13d
401496: 7e 38 jle 4014d0 <inline_exp1__mul_by_sum+0xa0>
401498: 45 31 f6 xor %r14d,%r14d
40149b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
4014a0: 48 8b 7d c8 mov -0x38(%rbp),%rdi
4014a4: 41 83 c6 01 add $0x1,%r14d
4014a8: 4d 89 e0 mov %r12,%r8
4014ab: 49 89 d9 mov %rbx,%r9
4014ae: 4c 89 e2 mov %r12,%rdx
4014b1: 48 89 d9 mov %rbx,%rcx
4014b4: 4c 89 fe mov %r15,%rsi
4014b7: e8 c4 00 00 00 callq 401580 <fz_arith__fz_add>
4014bc: 45 39 f5 cmp %r14d,%r13d
4014bf: 75 df jne 4014a0 <inline_exp1__mul_by_sum+0x70>
4014c1: 48 83 c4 18 add $0x18,%rsp
4014c5: 5b pop %rbx
4014c6: 41 5c pop %r12
4014c8: 41 5d pop %r13
4014ca: 41 5e pop %r14
4014cc: 41 5f pop %r15
4014ce: 5d pop %rbp
4014cf: c3 retq
4014d0: 31 c0 xor %eax,%eax
4014d2: 48 83 c4 18 add $0x18,%rsp
4014d6: 5b pop %rbx
4014d7: 41 5c pop %r12
4014d9: 41 5d pop %r13
4014db: 41 5e pop %r14
4014dd: 41 5f pop %r15
4014df: 5d pop %rbp
4014e0: c3 retq
4014e1: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4014e8: 00 00 00
4014eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Can you see the callq 401580 <fz_arith__fz_add> in there? clearly the FZ_Add was not inlined. Now for the second version;
./dump-function.sh inline_exp2_main inline_exp2__mul_by_sum
inline_exp2_main: file format elf64-x86-64
Disassembly of section .text:
0000000000401420 <inline_exp2__mul_by_sum>:
401420: 55 push %rbp
401421: 48 89 e5 mov %rsp,%rbp
401424: 41 57 push %r15
401426: 41 56 push %r14
401428: 41 55 push %r13
40142a: 41 54 push %r12
40142c: 53 push %rbx
40142d: 48 81 ec 38 10 00 00 sub $0x1038,%rsp
401434: 48 83 0c 24 00 orq $0x0,(%rsp)
401439: 48 81 c4 20 10 00 00 add $0x1020,%rsp
401440: 49 63 40 04 movslq 0x4(%r8),%rax
401444: 49 89 cf mov %rcx,%r15
401447: 49 63 08 movslq (%r8),%rcx
40144a: 49 89 d6 mov %rdx,%r14
40144d: 31 d2 xor %edx,%edx
40144f: 89 fb mov %edi,%ebx
401451: 49 89 f4 mov %rsi,%r12
401454: 4d 89 c5 mov %r8,%r13
401457: 39 c1 cmp %eax,%ecx
401459: 7f 1c jg 401477 <inline_exp2__mul_by_sum+0x57>
40145b: 48 29 c8 sub %rcx,%rax
40145e: 48 8d 14 c5 08 00 00 lea 0x8(,%rax,8),%rdx
401465: 00
401466: 48 b8 00 00 00 00 04 movabs $0x400000000,%rax
40146d: 00 00 00
401470: 48 39 c2 cmp %rax,%rdx
401473: 48 0f 47 d0 cmova %rax,%rdx
401477: 31 f6 xor %esi,%esi
401479: 4c 89 ff mov %r15,%rdi
40147c: e8 af 7e 03 00 callq 439330 <memset>
401481: 85 db test %ebx,%ebx
401483: 0f 8e db 00 00 00 jle 401564 <inline_exp2__mul_by_sum+0x144>
401489: 49 63 06 movslq (%r14),%rax
40148c: 45 8b 55 00 mov 0x0(%r13),%r10d
401490: 45 8b 5d 04 mov 0x4(%r13),%r11d
401494: 45 8b 76 04 mov 0x4(%r14),%r14d
401498: 49 89 c5 mov %rax,%r13
40149b: 48 89 45 c0 mov %rax,-0x40(%rbp)
40149f: 49 63 c2 movslq %r10d,%rax
4014a2: 4c 89 ef mov %r13,%rdi
4014a5: 48 29 c7 sub %rax,%rdi
4014a8: 49 8d 04 ff lea (%r15,%rdi,8),%rax
4014ac: 45 31 ff xor %r15d,%r15d
4014af: 48 89 45 c8 mov %rax,-0x38(%rbp)
4014b3: 4c 89 e8 mov %r13,%rax
4014b6: 48 f7 d8 neg %rax
4014b9: 49 8d 3c c4 lea (%r12,%rax,8),%rdi
4014bd: 4d 63 e6 movslq %r14d,%r12
4014c0: 41 83 c7 01 add $0x1,%r15d
4014c4: 45 39 ee cmp %r13d,%r14d
4014c7: 0f 8c 93 00 00 00 jl 401560 <inline_exp2__mul_by_sum+0x140>
4014cd: 48 8b 45 c0 mov -0x40(%rbp),%rax
4014d1: 4c 8b 4d c8 mov -0x38(%rbp),%r9
4014d5: 31 f6 xor %esi,%esi
4014d7: 4c 8d 40 ff lea -0x1(%rax),%r8
4014db: 48 89 f0 mov %rsi,%rax
4014de: 66 90 xchg %ax,%ax
4014e0: 49 83 c0 01 add $0x1,%r8
4014e4: 45 39 c3 cmp %r8d,%r11d
4014e7: 4a 8b 0c c7 mov (%rdi,%r8,8),%rcx
4014eb: 7c 53 jl 401540 <inline_exp2__mul_by_sum+0x120>
4014ed: 45 39 c2 cmp %r8d,%r10d
4014f0: 7f 4e jg 401540 <inline_exp2__mul_by_sum+0x120>
4014f2: 49 8b 11 mov (%r9),%rdx
4014f5: 48 01 c8 add %rcx,%rax
4014f8: 49 83 c1 08 add $0x8,%r9
4014fc: 48 01 d0 add %rdx,%rax
4014ff: 48 89 d6 mov %rdx,%rsi
401502: 48 21 ca and %rcx,%rdx
401505: 49 89 41 f8 mov %rax,-0x8(%r9)
401509: 48 09 ce or %rcx,%rsi
40150c: 48 f7 d0 not %rax
40150f: 48 21 f0 and %rsi,%rax
401512: 48 09 d0 or %rdx,%rax
401515: 48 c1 e8 3f shr $0x3f,%rax
401519: 4d 39 e0 cmp %r12,%r8
40151c: 75 c2 jne 4014e0 <inline_exp2__mul_by_sum+0xc0>
40151e: 48 89 c6 mov %rax,%rsi
401521: 44 39 fb cmp %r15d,%ebx
401524: 75 9a jne 4014c0 <inline_exp2__mul_by_sum+0xa0>
401526: 48 89 f0 mov %rsi,%rax
401529: 48 83 c4 18 add $0x18,%rsp
40152d: 5b pop %rbx
40152e: 41 5c pop %r12
401530: 41 5d pop %r13
401532: 41 5e pop %r14
401534: 41 5f pop %r15
401536: 5d pop %rbp
401537: c3 retq
401538: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40153f: 00
401540: 45 39 d5 cmp %r10d,%r13d
401543: 7c 05 jl 40154a <inline_exp2__mul_by_sum+0x12a>
401545: 45 39 de cmp %r11d,%r14d
401548: 7e a8 jle 4014f2 <inline_exp2__mul_by_sum+0xd2>
40154a: be 2f 00 00 00 mov $0x2f,%esi
40154f: bf 80 c4 49 00 mov $0x49c480,%edi
401554: 31 c0 xor %eax,%eax
401556: e8 e1 0e 00 00 callq 40243c <__gnat_rcheck_CE_Index_Check>
40155b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401560: 31 f6 xor %esi,%esi
401562: eb bd jmp 401521 <inline_exp2__mul_by_sum+0x101>
401564: 31 c0 xor %eax,%eax
401566: 48 83 c4 18 add $0x18,%rsp
40156a: 5b pop %rbx
40156b: 41 5c pop %r12
40156d: 41 5d pop %r13
40156f: 41 5e pop %r14
401571: 41 5f pop %r15
401573: 5d pop %rbp
401574: c3 retq
401575: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40157c: 00 00 00
40157f: 90 nop
In this version the call is gone, and the instructions that can be expected from FZ_Add and W_Carry are included. So inlining is happening and can be checked with objdump.
In the logs an assertion was made that gnat/gcc would include an inlined function for reference in the produced binary. This is not so, setting the "-ffunction-sections" flag for the compiler and combining this with the "--gc-sections" for the linker will remove all sections for symbols that are not referenced.