In the end of this weekend, I have discovered that my node is also in the list of wedged ones. However, the node restart did not help, and trb would reproducibly get stuck after a few hours. This warranted some debugging, so here is what I have go as a result meanwhile:
The observable deviation of node in wedge is that it spends *a lot* of time in SHA256 code1 :
#0 at some address in SHA256() #1 0x0000000000455aaa in Hash<__gnu_cxx::__normal_iterator<char*, std::vector<char, zero_after_free_allocator<char> > > > (pend=..., pbegin=...) at util.h:496 #2 CNode::EndMessage (this=0xe509020) at net.h:322 #3 0x0000000000447d46 in CNode::PushMessage(a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386 #4 ProcessMessage (pfrom=pfrom@entry=0xe509020, strCommand=..., vRecv=...) at main.cpp:1877 #5 0x000000000044a9e1 in ProcessMessages (pfrom=pfrom@entry=0xe509020) at main.cpp:2207 #6 0x000000000045d5c3 in ThreadMessageHandler2 (parg=parg@entry=0x0) at net.cpp:1129 #7 0x000000000045d7b7 in ThreadMessageHandler (parg=0x0) at net.cpp:1094
Poking it with gdb has shown that the amount of data to hash is around 4Gb:
(gdb) up #1 0x0000000000455aaa in Hash<__gnu_cxx::__normal_iterator> > > (pend=..., pbegin=...) at util.h:496 (gdb) print pbegin $13 = {_M_current = 0x7f416d082709 "k\230\242\241\246K٬\237믔*\220\320G\377\377\377\377\002\070\320\035"} (gdb) print pend $14 = {_M_current = 0x7f426d0938cb ""} (gdb) print 0x7f426d0938cb - 0x7f416d082709 $15 = 4295037378 (gdb) print/x 0x7f426d0938cb - 0x7f416d082709 $16 = 0x1000111c2
I was surprised that the application did not segfault, but no, all this memory is valid.
Meanwhile, do we have anything suspicious in the logs? I will not show the whole ~36000 continuous suspicious lines, it looked like a very long stream of:
received getdata for: block 0000000000000021f773 received getdata for: block 000000000000013cdf32 received getdata for: block 00000000000001614121 received getdata for: block 0000000000000152a8d3 received getdata for: block 000000000000010f9e85 received getdata for: block 00000000000000630879 received getdata for: block 000000000000007a52db received getdata for: block 00000000000000167398 received getdata for: block 0000000000000045e696 received getdata for: block 000000000000000e97b9 received getdata for: block 00000000000000dcb9b1 received getdata for: block 00000000000000d50c5a received getdata for: block 000000000000005346f8 received getdata for: block 00000000000001a18908 received getdata for: block 0000000000000115bca1 received getdata for: block 000000000000015bf999 received getdata for: block 00000000000001b4ba62 received getdata for: block 00000000000000095418 received getdata for: block 000000000000008e3f2f received getdata for: block 00000000000001a96ea8 received getdata for: block 00000000000000a5c79b received getdata for: block 00000000000000e03c41 received getdata for: block 0000000000000094b6a0 received getdata for: block 00000000000000fa78a1 received getdata for: block 00000000000001d2b140
If I go one level higher, the invocation of Hash happens in EndMessage:
void EndMessage() { if (mapArgs.count("-dropmessagestest") && GetRand(atoi(mapArgs["-dropmessagestest"])) == 0) { printf("dropmessages DROPPING SEND MESSAGE\n"); AbortMessage(); return; } if (nHeaderStart == -1) return; // Set the size unsigned int nSize = vSend.size() - nMessageStart; memcpy((char*)&vSend[nHeaderStart] + offsetof(CMessageHeader, nMessageSize), &nSize, sizeof(nSize)); // Set the checksum if (vSend.GetVersion() >= 209) { uint256 hash = Hash(vSend.begin() + nMessageStart, vSend.end()); unsigned int nChecksum = 0; memcpy(&nChecksum, &hash, sizeof(nChecksum)); assert(nMessageStart - nHeaderStart >= offsetof(CMessageHeader, nChecksum) + sizeof(nChecksum)); memcpy((char*)&vSend[nHeaderStart] + offsetof(CMessageHeader, nChecksum), &nChecksum, sizeof(nChecksum)); } if (fDebug) { printf("(%d bytes)\n", nSize); } nHeaderStart = -1; nMessageStart = -1; LEAVE_CRITICAL_SECTION(cs_vSend); }
One thing to test is if the span of (vSend.begin() + nMessageStart, vSend.end()) passed to Hash should be equal to nSize? Small changes to util.h:
inline uint256 Hash(const T1 pbegin, const T1 pend) { static unsigned char pblank[1]; uint256 hash1; if (pbegin > pend) { printf("%p > %p!\n", pbegin, pend); for(;;) { sleep(10); // spin } } if ((pend - pbegin) * sizeof(pbegin[0]) >= 0x10000) { printf("Size large: %zu (%p, %p)\n", (pend - pbegin) * sizeof(pbegin[0]), pbegin, pend); } if ((pend - pbegin) * sizeof(pbegin[0]) >= 0x100000000) { printf("Size wedge: %zu (%p, %p)\n", (pend - pbegin) * sizeof(pbegin[0]), pbegin, pend); for(;;) { sleep(10); // spin } } SHA256((pbegin == pend ? pblank : (unsigned char*)&pbegin[0]), (pend - pbegin) * sizeof(pbegin[0]), (unsigned char*)&hash1); uint256 hash2; SHA256((unsigned char*)&hash1, sizeof(hash1), (unsigned char*)&hash2); return hash2; }
Rebuild, launch with -debug, and indeed, in a few hours TRB gets stuck, and the span passed to Hash must be equal to nSize, from the logs:
received getdata for: block 0000000000000069f43c 02/24/20 15:37:47 sending: block Size large: 245265 (0x7f426cefdf51, 0x7f426cf39d62) (245265 bytes) received getdata for: block 00000000000001cfe272 02/24/20 15:37:47 sending: block (3985 bytes) received getdata for: block 00000000000001884a04 02/24/20 15:37:47 sending: block Size large: 106902 (0x7f426cf3ad23, 0x7f426cf54eb9) (106902 bytes) received getdata for: block 000000000000010bc490 02/24/20 15:37:47 sending: block (1366 bytes) received getdata for: block 000000000000002eac48 02/24/20 15:37:47 sending: block Size large: 191408 (0x7f426cf5543f, 0x7f426cf83fef) (191408 bytes) received getdata for: block 00000000000000e95188 02/24/20 15:37:47 sending: block Size large: 144180 (0x7f426cf84007, 0x7f426cfa733b) (144180 bytes) received getdata for: block 00000000000000abe2b0 02/24/20 15:37:47 sending: block Size large: 97463 (0x7f426cfa7353, 0x7f426cfbf00a) (97463 bytes) received getdata for: block 00000000000001cce3fd 02/24/20 15:37:47 sending: block (1201 bytes) received getdata for: block 000000000000002bbc8c 02/24/20 15:37:47 sending: block Size large: 120323 (0x7f426cfbf4eb, 0x7f426cfdcaee) (120323 bytes) received getdata for: block 00000000000001cb9e0f 02/24/20 15:37:47 sending: block (54066 bytes) received getdata for: block 00000000000000fcab37 02/24/20 15:37:47 sending: block Size large: 230268 (0x7f426cfe9e50, 0x7f426d0221cc) (230268 bytes) received getdata for: block 0000000000000018f8fe 02/24/20 15:37:47 sending: block (42035 bytes) received getdata for: block 00000000000001292f26 02/24/20 15:37:47 sending: block Size large: 103316 (0x7f426d02c62f, 0x7f426d0459c3) (103316 bytes) received getdata for: block 00000000000001a74535 02/24/20 15:37:47 sending: block Size large: 249110 (0x7f426d0459db, 0x7f426d0826f1) (249110 bytes) received getdata for: block 000000000000015b15b4 02/24/20 15:37:47 sending: block Size large: 4295037378 (0x7f416d082709, 0x7f426d0938cb) Size wedge: 4295037378 (0x7f416d082709, 0x7f426d0938cb)
The full logs for that run are available at this link here.
TRB did not print the nSize in this case, as now it is in the infinite sleep loop, but we can debug it still:
(gdb) up #2 CNode::EndMessage (this=0xe509020) at net.h:322 (gdb) print nSize $17 = 70082
What is the object that was pushed into vSend:
(gdb) up #3 0x0000000000447d46 in CNode::PushMessage(a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386 (gdb) print a1 (gdb) up #3 0x0000000000447d46 in CNode::PushMessage<CBlock> (a1=..., pszCommand=0x7aa8b6 "block", this=0xe509020) at net.h:386 (gdb) print a1 $18 = (const CBlock &) @0x7f43f5a17660: {nVersion = 2, hashPrevBlock = {<base_uint<256u>> = {pn = {1616193555, 2334131108, 1288270616, 3150207063, 1400775159, 1161119419, 423, 0}}, <No data fields>}, hashMerkleRoot = {<base_uint<256u>> = {pn = {4193101932, 3179851861, 3044709830, 2477049586, 3119633084, 4107622684, 3385259423, 2071486490}}, <No data fields>}, nTime = 1366624628, nBits = 436330132, nNonce = 905500765, vtx = {<std::_Vector_base<CTransaction, std::allocator<CTransaction> >> = { _M_impl = {<std::allocator<CTransaction>> = {<__gnu_cxx::new_allocator<CTransaction>> = {<No data fields>}, <No data fields>}, _M_start = 0xca11ba0, _M_finish = 0xca139a0, _M_end_of_storage = 0xca139a0}}, <No data fields>}, vMerkleTree = {<std::_Vector_base<uint256, std::allocator<uint256> >> = { _M_impl = {<std::allocator<uint256>> = {<__gnu_cxx::new_allocator<uint256>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}, nDoS = 0}
I did not go further in the debugging yet, and I believe someone more knowledgeable in the TRB internals could point me to the relevant files and data structures. The interesting bit is that:
# from the transcript above: (gdb) print/x 0x7f426d0938cb - 0x7f416d082709 $16 = 0x1000111c2 (gdb) print/x nSize $19 = 0x111c2
I hope it is not a well-reproducible bitflip, always in the same location in virtual memory, which it also allocates to make the flipped value valid. Otherwise, looks close enough to an uint32 underflow result that somehow got subtracted from vSend.begin().
Anyhow, I have this thing in looping in the debugger, so if you have ideas about where to poke, write in.
The coredump and the binary are available here.
- I have no record of this gdb session, unfortunately, so cannot pinpoint the exact function where time is spent, but this is irrelevant for the write-up. Still, not a good feeling when precision is lost. [↩]