GNAT and Pragma Inline_Always: A report

05/02/18, modified 17/09/18

Will the GNAT ADA compiler inline a function when using Pragma Inline_Always? and proves the dump of a function a red herring? In short, yes it will and no it doesn't. The Pragma Inline_Always needs to be put in the specification files ('*.ads') and not in the implementation files ('*.adb'). The dump of a
function in a binary can accurately show if inlining was performed or not.

To run the code, use the following v-patch on the code in FFA chapter 11;

We start with the preliminaries, as for the gnat version (gprbuild -version):

GPRBUILD GPL 2016 (20160515) (x86_64-pc-linux-gnu)
Copyright (C) 2004-2016, AdaCore
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

In FFA almost all functions have the pragma Inline_Always adornment. According to the gcc 4.9.4 documentation: Similar to pragma Inline except that inlining is not subject to the use of option -gnatn or -gnatN and the inlining happens regardless of whether these options are used.

All of the code is based on a copy of the ffademo files provided in the FFA chapter 1 genesis.

I base the tests on the FZ_Add procedure and use it to implement the slowest possible multiplication function. This multiplication function is specified to multiply a FZ by a natural number. To multiply we will add the input FZ to a tally as many times as given by the natural number.

   procedure Mul_By_Sum
     (N        : in     Natural;
      X        : in     FZ;
      Z        :    out FZ;
      Overflow :    out WBool)
   is

      C : WBool := 0;
   begin
      Z := (others => 0);
      for i in 1 .. N loop
         FZ_Add (X, Z, Z, C);
      end loop;
      Overflow := C;
   end Mul_By_Sum;

   procedure Inline_Experiment is
      X : FZ (1 .. 256) := (others => 0);
      Z : FZ (1 .. 256) := (others => 0);

      -- Carry.
      C : WBool := 0;

      N : Indices := 256;
   begin
      for i in 1 .. N loop
         X (i) := Word (i);
      end loop;

      Mul_By_Sum (10000000, X, Z, C);

      Dump (Z);

   end Inline_Experiment;

The experiment is divided over 3 programs, each with their own set of files;

  • inline_exp1_main; the FZ_Add and W_Carry procedures from libffa are used as is.
  • inline_exp2_main; the FZ_Add and W_Carry procedures are copied and included in the body of the module. For clarity we add a '2' to the procedure names.
  • inline_exp3_main: the FZ_Add and W_Carry functions are copied to a separate module and pragma Inline_Always is included in the specification file. For clarity we add a '3' to the procedure names.

Read the code in ffainline (except for the mul_by_sum these are all either calls to ffalib or copies of the code from ffalib), build and run everything with:

cd ffa/ffainline

gprbuild

time ./bin/inline_exp1_main > exp1.txt

time ./bin/inline_exp2_main > exp2.txt

time ./bin/inline_exp3_main > exp3.txt

diff exp0.txt exp1.txt

diff exp0.txt exp2.txt

diff exp0.txt exp3.txt

The diffs should be empty, and the timings should differ. A table with
my timings;



Experiment Time
(seconds)
1 8.579
2 4.924
3 4.871

The second and third versions are a little more than 40% faster than the first version. The only difference between the codes is where the procedures are defined, so the timing difference is all due to inlining. Let's check by disassembling the code with objdump. As everything will be disassembled by objdump, we need a small script to return the just code for one function;

#!/bin/sh

FUNCTIONLINE=`objdump -t $1  |  sort | nl | grep $2`
LINENR=`echo $FUNCTIONLINE | awk '{print $1}'`
NEXTLINENR=`expr $LINENR + 1`
NEXTLINE=`objdump -t $1  | sort | nl -s ' @ ' | grep $NEXTLINENR @ `
START=`echo $FUNCTIONLINE | awk '{print $2}'`
STOP=`echo $NEXTLINE | awk '{print $3}'`
objdump --start-address=0x$START --stop-address=0x$STOP -S -d $1

The dump for the first version of mul by sum2.

./dump-function.sh inline_exp1_main inline_exp1__mul_by_sum


inline_exp1_main:     file format elf64-x86-64

Disassembly of section .text:

0000000000401430 <inline_exp1__mul_by_sum>:
  401430:       55                      push   %rbp
  401431:       48 89 e5                mov    %rsp,%rbp
  401434:       41 57                   push   %r15
  401436:       41 56                   push   %r14
  401438:       41 55                   push   %r13
  40143a:       41 54                   push   %r12
  40143c:       53                      push   %rbx
  40143d:       48 81 ec 38 10 00 00    sub    $0x1038,%rsp
  401444:       48 83 0c 24 00          orq    $0x0,(%rsp)
  401449:       48 81 c4 20 10 00 00    add    $0x1020,%rsp
  401450:       49 63 40 04             movslq 0x4(%r8),%rax
  401454:       49 89 cc                mov    %rcx,%r12
  401457:       49 63 08                movslq (%r8),%rcx
  40145a:       49 89 d7                mov    %rdx,%r15
  40145d:       31 d2                   xor    %edx,%edx
  40145f:       41 89 fd                mov    %edi,%r13d
  401462:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
  401466:       4c 89 c3                mov    %r8,%rbx
  401469:       39 c1                   cmp    %eax,%ecx
  40146b:       7f 1c                   jg     401489 <inline_exp1__mul_by_sum+0x59>
  40146d:       48 29 c8                sub    %rcx,%rax
  401470:       48 8d 14 c5 08 00 00    lea    0x8(,%rax,8),%rdx
  401477:       00
  401478:       48 b8 00 00 00 00 04    movabs $0x400000000,%rax
  40147f:       00 00 00
  401482:       48 39 c2                cmp    %rax,%rdx
  401485:       48 0f 47 d0             cmova  %rax,%rdx
  401489:       31 f6                   xor    %esi,%esi
  40148b:       4c 89 e7                mov    %r12,%rdi
  40148e:       e8 bd 80 03 00          callq  439550 <memset>
  401493:       45 85 ed                test   %r13d,%r13d
  401496:       7e 38                   jle    4014d0 <inline_exp1__mul_by_sum+0xa0>
  401498:       45 31 f6                xor    %r14d,%r14d
  40149b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  4014a0:       48 8b 7d c8             mov    -0x38(%rbp),%rdi
  4014a4:       41 83 c6 01             add    $0x1,%r14d
  4014a8:       4d 89 e0                mov    %r12,%r8
  4014ab:       49 89 d9                mov    %rbx,%r9
  4014ae:       4c 89 e2                mov    %r12,%rdx
  4014b1:       48 89 d9                mov    %rbx,%rcx
  4014b4:       4c 89 fe                mov    %r15,%rsi
  4014b7:       e8 c4 00 00 00          callq  401580 <fz_arith__fz_add>
  4014bc:       45 39 f5                cmp    %r14d,%r13d
  4014bf:       75 df                   jne    4014a0 <inline_exp1__mul_by_sum+0x70>
  4014c1:       48 83 c4 18             add    $0x18,%rsp
  4014c5:       5b                      pop    %rbx
  4014c6:       41 5c                   pop    %r12
  4014c8:       41 5d                   pop    %r13
  4014ca:       41 5e                   pop    %r14
  4014cc:       41 5f                   pop    %r15
  4014ce:       5d                      pop    %rbp
  4014cf:       c3                      retq
  4014d0:       31 c0                   xor    %eax,%eax
  4014d2:       48 83 c4 18             add    $0x18,%rsp
  4014d6:       5b                      pop    %rbx
  4014d7:       41 5c                   pop    %r12
  4014d9:       41 5d                   pop    %r13
  4014db:       41 5e                   pop    %r14
  4014dd:       41 5f                   pop    %r15
  4014df:       5d                      pop    %rbp
  4014e0:       c3                      retq
  4014e1:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4014e8:       00 00 00
  4014eb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Can you see the callq 401580 <fz_arith__fz_add> in there? clearly the FZ_Add was not inlined. Now for the second version;

./dump-function.sh inline_exp2_main inline_exp2__mul_by_sum


inline_exp2_main:     file format elf64-x86-64

Disassembly of section .text:

0000000000401420 <inline_exp2__mul_by_sum>:
  401420:       55                      push   %rbp
  401421:       48 89 e5                mov    %rsp,%rbp
  401424:       41 57                   push   %r15
  401426:       41 56                   push   %r14
  401428:       41 55                   push   %r13
  40142a:       41 54                   push   %r12
  40142c:       53                      push   %rbx
  40142d:       48 81 ec 38 10 00 00    sub    $0x1038,%rsp
  401434:       48 83 0c 24 00          orq    $0x0,(%rsp)
  401439:       48 81 c4 20 10 00 00    add    $0x1020,%rsp
  401440:       49 63 40 04             movslq 0x4(%r8),%rax
  401444:       49 89 cf                mov    %rcx,%r15
  401447:       49 63 08                movslq (%r8),%rcx
  40144a:       49 89 d6                mov    %rdx,%r14
  40144d:       31 d2                   xor    %edx,%edx
  40144f:       89 fb                   mov    %edi,%ebx
  401451:       49 89 f4                mov    %rsi,%r12
  401454:       4d 89 c5                mov    %r8,%r13
  401457:       39 c1                   cmp    %eax,%ecx
  401459:       7f 1c                   jg     401477 <inline_exp2__mul_by_sum+0x57>
  40145b:       48 29 c8                sub    %rcx,%rax
  40145e:       48 8d 14 c5 08 00 00    lea    0x8(,%rax,8),%rdx
  401465:       00
  401466:       48 b8 00 00 00 00 04    movabs $0x400000000,%rax
  40146d:       00 00 00
  401470:       48 39 c2                cmp    %rax,%rdx
  401473:       48 0f 47 d0             cmova  %rax,%rdx
  401477:       31 f6                   xor    %esi,%esi
  401479:       4c 89 ff                mov    %r15,%rdi
  40147c:       e8 af 7e 03 00          callq  439330 <memset>
  401481:       85 db                   test   %ebx,%ebx
  401483:       0f 8e db 00 00 00       jle    401564 <inline_exp2__mul_by_sum+0x144>
  401489:       49 63 06                movslq (%r14),%rax
  40148c:       45 8b 55 00             mov    0x0(%r13),%r10d
  401490:       45 8b 5d 04             mov    0x4(%r13),%r11d
  401494:       45 8b 76 04             mov    0x4(%r14),%r14d
  401498:       49 89 c5                mov    %rax,%r13
  40149b:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  40149f:       49 63 c2                movslq %r10d,%rax
  4014a2:       4c 89 ef                mov    %r13,%rdi
  4014a5:       48 29 c7                sub    %rax,%rdi
  4014a8:       49 8d 04 ff             lea    (%r15,%rdi,8),%rax
  4014ac:       45 31 ff                xor    %r15d,%r15d
  4014af:       48 89 45 c8             mov    %rax,-0x38(%rbp)
  4014b3:       4c 89 e8                mov    %r13,%rax
  4014b6:       48 f7 d8                neg    %rax
  4014b9:       49 8d 3c c4             lea    (%r12,%rax,8),%rdi
  4014bd:       4d 63 e6                movslq %r14d,%r12
  4014c0:       41 83 c7 01             add    $0x1,%r15d
  4014c4:       45 39 ee                cmp    %r13d,%r14d
  4014c7:       0f 8c 93 00 00 00       jl     401560 <inline_exp2__mul_by_sum+0x140>
  4014cd:       48 8b 45 c0             mov    -0x40(%rbp),%rax
  4014d1:       4c 8b 4d c8             mov    -0x38(%rbp),%r9
  4014d5:       31 f6                   xor    %esi,%esi
  4014d7:       4c 8d 40 ff             lea    -0x1(%rax),%r8
  4014db:       48 89 f0                mov    %rsi,%rax
  4014de:       66 90                   xchg   %ax,%ax
  4014e0:       49 83 c0 01             add    $0x1,%r8
  4014e4:       45 39 c3                cmp    %r8d,%r11d
  4014e7:       4a 8b 0c c7             mov    (%rdi,%r8,8),%rcx
  4014eb:       7c 53                   jl     401540 <inline_exp2__mul_by_sum+0x120>
  4014ed:       45 39 c2                cmp    %r8d,%r10d
  4014f0:       7f 4e                   jg     401540 <inline_exp2__mul_by_sum+0x120>
  4014f2:       49 8b 11                mov    (%r9),%rdx
  4014f5:       48 01 c8                add    %rcx,%rax
  4014f8:       49 83 c1 08             add    $0x8,%r9
  4014fc:       48 01 d0                add    %rdx,%rax
  4014ff:       48 89 d6                mov    %rdx,%rsi
  401502:       48 21 ca                and    %rcx,%rdx
  401505:       49 89 41 f8             mov    %rax,-0x8(%r9)
  401509:       48 09 ce                or     %rcx,%rsi
  40150c:       48 f7 d0                not    %rax
  40150f:       48 21 f0                and    %rsi,%rax
  401512:       48 09 d0                or     %rdx,%rax
  401515:       48 c1 e8 3f             shr    $0x3f,%rax
  401519:       4d 39 e0                cmp    %r12,%r8
  40151c:       75 c2                   jne    4014e0 <inline_exp2__mul_by_sum+0xc0>
  40151e:       48 89 c6                mov    %rax,%rsi
  401521:       44 39 fb                cmp    %r15d,%ebx
  401524:       75 9a                   jne    4014c0 <inline_exp2__mul_by_sum+0xa0>
  401526:       48 89 f0                mov    %rsi,%rax
  401529:       48 83 c4 18             add    $0x18,%rsp
  40152d:       5b                      pop    %rbx
  40152e:       41 5c                   pop    %r12
  401530:       41 5d                   pop    %r13
  401532:       41 5e                   pop    %r14
  401534:       41 5f                   pop    %r15
  401536:       5d                      pop    %rbp
  401537:       c3                      retq
  401538:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40153f:       00
  401540:       45 39 d5                cmp    %r10d,%r13d
  401543:       7c 05                   jl     40154a <inline_exp2__mul_by_sum+0x12a>
  401545:       45 39 de                cmp    %r11d,%r14d
  401548:       7e a8                   jle    4014f2 <inline_exp2__mul_by_sum+0xd2>
  40154a:       be 2f 00 00 00          mov    $0x2f,%esi
  40154f:       bf 80 c4 49 00          mov    $0x49c480,%edi
  401554:       31 c0                   xor    %eax,%eax
  401556:       e8 e1 0e 00 00          callq  40243c <__gnat_rcheck_CE_Index_Check>
  40155b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  401560:       31 f6                   xor    %esi,%esi
  401562:       eb bd                   jmp    401521 <inline_exp2__mul_by_sum+0x101>
  401564:       31 c0                   xor    %eax,%eax
  401566:       48 83 c4 18             add    $0x18,%rsp
  40156a:       5b                      pop    %rbx
  40156b:       41 5c                   pop    %r12
  40156d:       41 5d                   pop    %r13
  40156f:       41 5e                   pop    %r14
  401571:       41 5f                   pop    %r15
  401573:       5d                      pop    %rbp
  401574:       c3                      retq
  401575:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40157c:       00 00 00
  40157f:       90                      nop

In this version the call is gone, and the instructions that can be expected from FZ_Add and W_Carry are included. So inlining is happening and can be checked with objdump.

In the logs an assertion was made that gnat/gcc would include an inlined function for reference in the produced binary. This is not so, setting the "-ffunction-sections" flag for the compiler and combining this with the "--gc-sections" for the linker will remove all sections for symbols that are not referenced3.

  1. This an demonstration patch, purely to investigate the inline options, not for production purposes []
  2. an ada symbol name can be easily guessed; all lower case module name, two _, all lower case procedure / function name []
  3. easy to check with nm on the binaries, no fz_add in experiment 2 or 3 []

5 Responses to “GNAT and Pragma Inline_Always: A report”

  1. How then to explain the massive timing variance when my "incorrect" inline pragmas are selectively removed?

  2. FTR I see exactly the same effect on the dump when toggling the pragmas in their current (i.e. in the .adb bodies) locations. (This is on gnat 2016).

    • ave1 says:

      Which calls are you looking at? for methods in the same module the inlining pragma works. For methods outside of the module it seems to not work.

  3. [...] ave1 carefully analyzed the executables generated by AdaCore’s x86-64 GNAT, and showed that this compiler is prone to ignore Inline pragmas unless they are specified in a particular [...]

  4. [...] ave1 carefully analyzed the executables generated by AdaCore’s x86-64 GNAT, and showed that this compiler is prone to ignore Inline pragmas unless they are specified in a particular [...]

Leave a Reply to Loper OS » “Finite Field Arithmetic.” Chapter 11: Tuning and Unified API.