GNAT Zero Cost Exceptions and Asynchronous Task Aborting. Part 2.

February 17th, 2019

Before going into the detail, lets have a look at modified diana_coman's test application. The main module:

with Ada.Text_IO; use Ada.Text_IO;
with Tasks; use Tasks;
with Ada.Task_Identification;
with GNAT.OS_Lib;

procedure Adatests is
  Total: Natural := 0;
begin
  Put_Line("Creating " & Natural'Image(Max_Tasks) & " tasks.");
  Tasks.Create_Tasks(Max_Tasks);

  -- delay 5000.0;
  Put_Line("Will abort main program with C's exit");
  GNAT.OS_Lib.OS_Exit(0);
  Put_Line("MEGA FAIL.");
end Adatests;

We spawn Max_Tasks tasks; while spawning, a race condition should manifest. If it does not, we just exit, and the operator has to retry. We don't abort tasks, for the reasons that will become apparent soon. The commented out delay line is to be uncommented for gdb'ing, so that the application does not 'evaporate' during the debugging session.

Tasks.ads: Max_Tasks is set to 100 to make race condition more probable, procedures other than Create_Tasks are gone.

package Tasks is
  Max_Tasks: constant Natural := 100;
  task type TestTask( ID: Natural );
  type Task_Addr is access TestTask;
  subtype Task_Count is Natural range 0..Max_Tasks;
  type TA is array(1..Task_Count'Last) of Task_Addr;

  procedure Create_Tasks(N: in Task_Count);
private
  A: TA;
end Tasks;

alltests.gpr. Static linking is a must, the problem does not appear under dynamic linking. Adding -Wl,--eh-frame-hdr to the linker command line makes race condition much less probable, but does not eliminate it.

project AllTests is
  for Languages use ("Ada");

  for Source_Dirs use ("tests");
  for Object_Dir use "obj";

  for Ignore_Source_Sub_Dirs use (".svn", ".git", "@*");

  for Exec_Dir use "."; -- create exec here at top level

  for Main use ("adatests.adb");
  package Compiler is
     for Switches ("Ada")
       use ("-Og", "-ggdb3");
  end Compiler;
  package Linker is
    -- for Switches ("Ada") use ("-static", "-static-libgcc", "-Wl,--eh-frame-hdr");
    for Switches ("Ada") use ("-static", "-static-libgcc");
  end Linker;
end AllTests;

Finally, tasks.adb. Tasks that are spawned in Create_Tasks just exit, immediately, by raising an exception. This is a simplification compared to the application from the previous post: polling instrumentation from 'pragma Polling(ON)' and aborting procedure were working correctly. In gnat runtime, aborting a thread happens by raising an Standard'Abort_Signal exception. The race condition is in the code that raises an exception. So we don't use the 'Abort_Signal' exception at all, Constraint_Error will work for us just the same, and will also show that "don't abort threads, then" is not a solution to the problem, because the same problem will appear when two threads happen to raise exceptions concurrently.

with Ada.Text_IO; use Ada.Text_IO;

package body Tasks is
  task body TestTask is
  begin
    raise Constraint_Error;
  end TestTask;

  procedure Create_Tasks(N: in Task_Count) is
  begin
    for I in 1..N loop
      A(I) := new TestTask(I);
    end loop;
  end Create_Tasks;

end Tasks;

So, after spending some time debugging the issue I have found the culprit: this is a bad design inside GCC, which breaks statically linked multithreaded applications that use exceptions. To raise an exception in Ada, libgnarl allocates an exception data structure, fills it in, and passes to _Unwind_RaiseException, which is a libgcc function. Further discussion has nothing todo with GNAT or Ada1, also abort from now on means C's abort(3), not Ada's task abort.

_Unwind_Reason_Code
__gnat_Unwind_RaiseException (_Unwind_Exception *e)
{
#ifdef __USING_SJLJ_EXCEPTIONS__
  return _Unwind_SjLj_RaiseException (e);
#else
  return _Unwind_RaiseException (e);
#endif
}

Inside libgcc, stack is unwound to find the exception handler; for unwinding to work, it needs to consult ELF tables with information on how to unwind stack at each Program Counter value2. This unwinding must happen with locks taken (search for locks), otherwise threads may corrupt each others state3. While I can not exclude silent state corruption in some cases, and I have seen deadlocks as well, these are much more rare than crashes, so further discussion concerns only the most probable failure mode.

// From unwind-dw2-fde.c:
const fde *
_Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
{
  struct object *ob;
  const fde *f = NULL;

  init_object_mutex_once ();
  __gthread_mutex_lock (&object_mutex);

  /* Linear search through the classified objects, to find the one
     containing the pc.  Note that pc_begin is sorted descending, and
     we expect objects to be non-overlapping.  */
  // ...

  /* Classify and search the objects we've not yet processed.  */
  // ...

 fini:
  __gthread_mutex_unlock (&object_mutex);

  if (f)
    {
      // ...
    }

  return f;
}

There is a sanity check (gcc_assert (code == _URC_NO_REASON);) after looking up an FDE (frame description entry), that is typically tripped up, and which in the end triggers an abort(3) invocation. Depending on the musl version used, this can appear as a segfault (trying to execute ud2 instruction on Intel), or as a well-behaved abort on later musl versions.

// From unwind-dw2.c
static void __attribute__((noinline))
uw_init_context_1 (struct _Unwind_Context *context,
                   void *outer_cfa, void *outer_ra)
{
  void *ra = __builtin_extract_return_addr (__builtin_return_address (0));
  _Unwind_FrameState fs;
  _Unwind_SpTmp sp_slot;
  _Unwind_Reason_Code code;

  memset (context, 0, sizeof (struct _Unwind_Context));
  context->ra = ra;
  if (!ASSUME_EXTENDED_UNWIND_CONTEXT)
    context->flags = EXTENDED_CONTEXT_BIT;

  code = uw_frame_state_for (context, &fs);
  gcc_assert (code == _URC_NO_REASON);
  //...

Why is there a race condition even if there are locks in place? Well, GCC devs have applied an 'optimization' to their locking infrastructure (see __gthread_active_p), disabling it for single-threaded applications. How do they detect if the application is single-threaded? Well, they check if a certain function is linked into the application, by using 'weak symbols' mechanism of linker4. But with static linking presence of one symbol does not signal the presence of any other symbols. So with statically linked musl the problem can manifest in different ways: locking disabled and dead-code-eliminated while the locking code is in place (the multithread-presence-signalling function is not linked in, while others are), or the the locking is enabled, with calls to null address instead of calls to pthread_mutex_lock, and so on.

// From gthr-posix.h
#if SUPPORTS_WEAK && GTHREAD_USE_WEAK
# ifndef __gthrw_pragma
#  define __gthrw_pragma(pragma)
# endif
# define __gthrw2(name,name2,type) \
  static __typeof(type) name \
    __attribute__ ( (__weakref__(#name2), __copy__ (type))); \
  __gthrw_pragma(weak type)
# define __gthrw_(name) __gthrw_ ## name
#else
# define __gthrw2(name,name2,type)
# define __gthrw_(name) name
#endif
 //...
#ifdef __GLIBC__
__gthrw2(__gthrw_(__pthread_key_create),
	 __pthread_key_create,
	 pthread_key_create)
# define GTHR_ACTIVE_PROXY	__gthrw_(__pthread_key_create)
#elif defined (__BIONIC__)
# define GTHR_ACTIVE_PROXY	__gthrw_(pthread_create)
#else
# define GTHR_ACTIVE_PROXY	__gthrw_(pthread_cancel)
#endif

static inline int
__gthread_active_p (void)
{
  static void *const __gthread_active_ptr
    = __extension__ (void *) &GTHR_ACTIVE_PROXY;
  return __gthread_active_ptr != 0;
}
//...
static inline int
__gthread_mutex_lock (__gthread_mutex_t *__mutex)
{
  if (__gthread_active_p ())
    return __gthrw_(pthread_mutex_lock) (__mutex);
  else
    return 0;
}

The solution is of course to disable the weak symbol mechanism, however I have failed to do this correctly so far. Musl people managed to solve it for C++ and Fortran somehow. The patch that did the trick for other people back in the day does not work for me: with it, linking with libgcc causes 'unresolved reference' errors over a subset of symbols that were declared as weak previously, even with all references to the word 'weak' eliminated from the source. This patch can work with ave1's gnat, perhaps.

In meantime I also tried to get ave1gnat running: turned out that I can't bootstrap it on any machine that I have. Ave1gnat requires Adacore GNAT-2016 for bootstrap. However this version of GNAT relies on the host system to provide libc CRT files (crtn.o, crt1.o, possibly others). On the other hand, around gcc5 times (early 2016) binutils were verschlimmbessert with support of new relocations, which found their way into the CRT files. So adacore-gnat, which ships with old binutils, barfs (unrecognized relocation) when linking the CRT into the binary. And bootstrapping with GNAT-2017 does not work, because it does not like the style and the layout of some structure in the GNAT-2016 source.

One approach to this problem would be me downgrading binutils and system libraries, or perhaps applying the patches (one, two) and bootstraping with gnat-2017. But the problem remains, because on Cuntoo the bootstrapper won't work at all: Adacore's distribution is glibc-based, while Cuntoo is musl-based so an alternative bootstrapping compiler is required -- Adacore's gnat just does not run there. IMO, all this really calls for binary distribution of pre-build ave1gnat for bootstrapping purposes.

  1. Suggesting that the same will be true for a fix, when it is found. []
  2. The terminology may be slightly off here, I'm definitely not an expert on exception handling inside GCC. []
  3. The file linked is for ZCX runtime, but the same problem can appear with SJLJ as well.
    By the way, why is there no support for asynchronous cancellation with ZCX? I tried to look at the historical GCC code, it seems that this restriction is in place since at least 2001, but at no point did I manage to find any rationale for this decision, no explanation what breakage to expect, nothing. Commit messages and Changelog entries from that time are not helpful at all. As I don't have enough fundamental knowledge to answer this question, it seems that the only way to get the information is ask someone from Adacore. []
  4. A symbol is initialized to a default value unless there is an external symbol with that name. If there is an external symbol, it overwrites the weak symbol instead of causing a symbol clash error. []