Archive for the ‘Linux’ Category

A tiny and incomplete TRB wedgetrace

Tuesday, February 25th, 2020

In the end of this weekend, I have discovered that my node is also in the list of wedged ones. However, the node restart did not help, and trb would reproducibly get stuck after a few hours. This warranted some debugging, so here is what I have go as a result meanwhile:

The observable deviation of node in wedge is that it spends *a lot* of time in SHA256 code1 :

#0 at some address in SHA256()
#1  0x0000000000455aaa in Hash<__gnu_cxx::__normal_iterator<char*, std::vector<char, zero_after_free_allocator<char> > > > (pend=..., pbegin=...) at util.h:496
#2  CNode::EndMessage (this=0xe509020) at net.h:322
#3  0x0000000000447d46 in CNode::PushMessage (a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386
#4  ProcessMessage (pfrom=pfrom@entry=0xe509020, strCommand=..., vRecv=...) at main.cpp:1877
#5  0x000000000044a9e1 in ProcessMessages (pfrom=pfrom@entry=0xe509020) at main.cpp:2207
#6  0x000000000045d5c3 in ThreadMessageHandler2 (parg=parg@entry=0x0) at net.cpp:1129
#7  0x000000000045d7b7 in ThreadMessageHandler (parg=0x0) at net.cpp:1094

Poking it with gdb has shown that the amount of data to hash is around 4Gb:

(gdb) up
#1  0x0000000000455aaa in Hash<__gnu_cxx::__normal_iterator > > > (pend=..., pbegin=...) at util.h:496
(gdb) print pbegin
$13 = {_M_current = 0x7f416d082709 "k\230\242\241\246K٬\237믔*\220\320G\377\377\377\377\002\070\320\035"}
(gdb) print pend
$14 = {_M_current = 0x7f426d0938cb ""}
(gdb) print 0x7f426d0938cb - 0x7f416d082709
$15 = 4295037378
(gdb) print/x 0x7f426d0938cb - 0x7f416d082709
$16 = 0x1000111c2

I was surprised that the application did not segfault, but no, all this memory is valid.

Meanwhile, do we have anything suspicious in the logs? I will not show the whole ~36000 continuous suspicious lines, it looked like a very long stream of:

received getdata for: block 0000000000000021f773
received getdata for: block 000000000000013cdf32
received getdata for: block 00000000000001614121
received getdata for: block 0000000000000152a8d3
received getdata for: block 000000000000010f9e85
received getdata for: block 00000000000000630879
received getdata for: block 000000000000007a52db
received getdata for: block 00000000000000167398
received getdata for: block 0000000000000045e696
received getdata for: block 000000000000000e97b9
received getdata for: block 00000000000000dcb9b1
received getdata for: block 00000000000000d50c5a
received getdata for: block 000000000000005346f8
received getdata for: block 00000000000001a18908
received getdata for: block 0000000000000115bca1
received getdata for: block 000000000000015bf999
received getdata for: block 00000000000001b4ba62
received getdata for: block 00000000000000095418
received getdata for: block 000000000000008e3f2f
received getdata for: block 00000000000001a96ea8
received getdata for: block 00000000000000a5c79b
received getdata for: block 00000000000000e03c41
received getdata for: block 0000000000000094b6a0
received getdata for: block 00000000000000fa78a1
received getdata for: block 00000000000001d2b140

If I go one level higher, the invocation of Hash happens in EndMessage:

    void EndMessage()
    {
        if (mapArgs.count("-dropmessagestest") && GetRand(atoi(mapArgs["-dropmessagestest"])) == 0)
        {
            printf("dropmessages DROPPING SEND MESSAGE\n");
            AbortMessage();
            return;
        }

        if (nHeaderStart == -1)
            return;

        // Set the size
        unsigned int nSize = vSend.size() - nMessageStart;
        memcpy((char*)&vSend[nHeaderStart] + offsetof(CMessageHeader, nMessageSize), &nSize, sizeof(nSize));

        // Set the checksum
        if (vSend.GetVersion() >= 209)
        {
            uint256 hash = Hash(vSend.begin() + nMessageStart, vSend.end());
            unsigned int nChecksum = 0;
            memcpy(&nChecksum, &hash, sizeof(nChecksum));
            assert(nMessageStart - nHeaderStart >= offsetof(CMessageHeader, nChecksum) + sizeof(nChecksum));
            memcpy((char*)&vSend[nHeaderStart] + offsetof(CMessageHeader, nChecksum), &nChecksum, sizeof(nChecksum));
        }

        if (fDebug) {
            printf("(%d bytes)\n", nSize);
        }

        nHeaderStart = -1;
        nMessageStart = -1;
        LEAVE_CRITICAL_SECTION(cs_vSend);
    }

One thing to test is if the span of (vSend.begin() + nMessageStart, vSend.end()) passed to Hash should be equal to nSize? Small changes to util.h:

inline uint256 Hash(const T1 pbegin, const T1 pend)
{
    static unsigned char pblank[1];
    uint256 hash1;
    if (pbegin > pend) {
      printf("%p > %p!\n", pbegin, pend);
      for(;;) {
        sleep(10); // spin
      }
    }
    if ((pend - pbegin) * sizeof(pbegin[0]) >= 0x10000) {
      printf("Size large: %zu (%p, %p)\n", (pend - pbegin) * sizeof(pbegin[0]), pbegin, pend);
    }
    if ((pend - pbegin) * sizeof(pbegin[0]) >= 0x100000000) {
      printf("Size wedge: %zu (%p, %p)\n", (pend - pbegin) * sizeof(pbegin[0]), pbegin, pend);
      for(;;) {
        sleep(10); // spin
      }
    }
    SHA256((pbegin == pend ? pblank : (unsigned char*)&pbegin[0]), (pend - pbegin) * sizeof(pbegin[0]), (unsigned char*)&hash1);
    uint256 hash2;
    SHA256((unsigned char*)&hash1, sizeof(hash1), (unsigned char*)&hash2);
    return hash2;
}

Rebuild, launch with -debug, and indeed, in a few hours TRB gets stuck, and the span passed to Hash must be equal to nSize, from the logs:

received getdata for: block 0000000000000069f43c
02/24/20 15:37:47 sending: block Size large: 245265 (0x7f426cefdf51, 0x7f426cf39d62)
(245265 bytes)
received getdata for: block 00000000000001cfe272
02/24/20 15:37:47 sending: block (3985 bytes)
received getdata for: block 00000000000001884a04
02/24/20 15:37:47 sending: block Size large: 106902 (0x7f426cf3ad23, 0x7f426cf54eb9)
(106902 bytes)
received getdata for: block 000000000000010bc490
02/24/20 15:37:47 sending: block (1366 bytes)
received getdata for: block 000000000000002eac48
02/24/20 15:37:47 sending: block Size large: 191408 (0x7f426cf5543f, 0x7f426cf83fef)
(191408 bytes)
received getdata for: block 00000000000000e95188
02/24/20 15:37:47 sending: block Size large: 144180 (0x7f426cf84007, 0x7f426cfa733b)
(144180 bytes)
received getdata for: block 00000000000000abe2b0
02/24/20 15:37:47 sending: block Size large: 97463 (0x7f426cfa7353, 0x7f426cfbf00a)
(97463 bytes)
received getdata for: block 00000000000001cce3fd
02/24/20 15:37:47 sending: block (1201 bytes)
received getdata for: block 000000000000002bbc8c
02/24/20 15:37:47 sending: block Size large: 120323 (0x7f426cfbf4eb, 0x7f426cfdcaee)
(120323 bytes)
received getdata for: block 00000000000001cb9e0f
02/24/20 15:37:47 sending: block (54066 bytes)
received getdata for: block 00000000000000fcab37
02/24/20 15:37:47 sending: block Size large: 230268 (0x7f426cfe9e50, 0x7f426d0221cc)
(230268 bytes)
received getdata for: block 0000000000000018f8fe
02/24/20 15:37:47 sending: block (42035 bytes)
received getdata for: block 00000000000001292f26
02/24/20 15:37:47 sending: block Size large: 103316 (0x7f426d02c62f, 0x7f426d0459c3)
(103316 bytes)
received getdata for: block 00000000000001a74535
02/24/20 15:37:47 sending: block Size large: 249110 (0x7f426d0459db, 0x7f426d0826f1)
(249110 bytes)
received getdata for: block 000000000000015b15b4
02/24/20 15:37:47 sending: block Size large: 4295037378 (0x7f416d082709, 0x7f426d0938cb)
Size wedge: 4295037378 (0x7f416d082709, 0x7f426d0938cb)

The full logs for that run are available at this link here.

TRB did not print the nSize in this case, as now it is in the infinite sleep loop, but we can debug it still:

(gdb) up
#2  CNode::EndMessage (this=0xe509020) at net.h:322
(gdb) print nSize
$17 = 70082

What is the object that was pushed into vSend:

(gdb) up
#3  0x0000000000447d46 in CNode::PushMessage (a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386
(gdb) print a1
(gdb) up
#3  0x0000000000447d46 in CNode::PushMessage<CBlock> (a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386
(gdb) print a1
$18 = (const CBlock &) @0x7f43f5a17660: {nVersion = 2, hashPrevBlock = {<base_uint<256u>> = {pn = {1616193555, 2334131108, 1288270616, 3150207063, 1400775159, 1161119419, 423, 0}}, <No data fields>},
  hashMerkleRoot = {<base_uint<256u>> = {pn = {4193101932, 3179851861, 3044709830, 2477049586, 3119633084, 4107622684, 3385259423, 2071486490}}, <No data fields>}, nTime = 1366624628, nBits = 436330132,
  nNonce = 905500765, vtx = {<std::_Vector_base<CTransaction, std::allocator<CTransaction> >> = {
      _M_impl = {<std::allocator<CTransaction>> = {<__gnu_cxx::new_allocator<CTransaction>> = {<No data fields>}, <No data fields>}, _M_start = 0xca11ba0, _M_finish = 0xca139a0,
        _M_end_of_storage = 0xca139a0}}, <No data fields>}, vMerkleTree = {<std::_Vector_base<uint256, std::allocator<uint256> >> = {
      _M_impl = {<std::allocator<uint256>> = {<__gnu_cxx::new_allocator<uint256>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}, nDoS = 0}

I did not go further in the debugging yet, and I believe someone more knowledgeable in the TRB internals could point me to the relevant files and data structures. The interesting bit is that:

# from the transcript above:
(gdb) print/x 0x7f426d0938cb - 0x7f416d082709
$16 = 0x1000111c2
(gdb) print/x nSize
$19 = 0x111c2

I hope it is not a well-reproducible bitflip, always in the same location in virtual memory, which it also allocates to make the flipped value valid. Otherwise, looks close enough to an uint32 underflow result that somehow got subtracted from vSend.begin().

Anyhow, I have this thing in looping in the debugger, so if you have ideas about where to poke, write in.


The coredump and the binary are available here.

  1. I have no record of this gdb session, unfortunately, so cannot pinpoint the exact function where time is spent, but this is irrelevant for the write-up. Still, not a good feeling when precision is lost. []