Benchmarking hashes for FG-fed Linux

November 14th, 2019

UPDATE: the vpatch published originally was missing the fix for the bug outlined here. It is doubly shameful because not only spent several days to discover it again when working the next vpatch, I missed it in the writeup in the most embarrassing way. The reason for this was me relying on make clean and hand-picking edited files for vdiff, instead of using make mrproper and using the vtree directly. I have amended the process accordingly now.

keccak_squeeze should read:

void
keccak_squeeze(struct keccak_state *sctx, u8 *out)
{
       unsigned int i;
       u64 st[KECCAK_STATE_SIZE] = {0};
       keccakf(sctx->st);
       for (i = 0; i < KECCAK_STATE_SIZE; i++) {
               st[i] = cpu_to_le64(sctx->st[i]);
       }
       memcpy(&out[0], &st[0], sctx->rsiz);
}

The updated vpatch and seal are here:

curl 'http://bvt-trace.net/vpatches/linux-benchmark-v2.vpatch' > linux-benchmark-v2.vpatch
curl 'http://bvt-trace.net/vpatches/linux-benchmark-v2.vpatch.bvt.sig' > linux-benchmark-v2.vpatch.bvt.sig

Last article has raised the question of what cryptographic transformation (cipher or hash) should be used for good and fast hash implementation. I will provide some revelant information in this article.

The functions that I have used for measurements are:

  • SHA1 - was used in the original implementation;
  • ChaCha20 - was used in the original implementation;
  • Serpent - has survived at least some investigation;
  • Keccak - the standard hash so far.

Keccak Implementation

So, first I discovered that there is no Keccak implementation in Linux kernel. OTOH, there is a SHA3 implementation that is same modulo padding, which I have used as a foundation for the in-kernel Keccak. Let's start with include/linux/keccak.h

#define KECCAK_DIGEST_SIZE      (512 / 8)
#define KECCAK_BLOCK_SIZE       (1344 / 8)
#define KECCAK_STATE_SIZE       25

struct keccak_state {
        u64             st[KECCAK_STATE_SIZE];
        unsigned int    rsiz;
        unsigned int    rsizw;

        unsigned int    partial;
        u8              buf[KECCAK_BLOCK_SIZE];
};

void keccak_init(struct keccak_state *sctx);
int keccak_update(struct keccak_state *sctx, const u8 *data, unsigned int len);
int keccak_final(struct keccak_state *sctx, u8 *out, size_t out_len);

The standard Keccak hash as used at least in vtools is KECCAK_DIGEST_SIZE bytes long (512 bits), has rate 1344 which translates into an internal buffer of KECCAK_BLOCK_SIZE (168) bytes. The state of Keccak-1600 consists of 25 64-bit words. The interface is dead-simple: keccak_init, feed in data using keccak_update, pad&output using keccak_final.

Now, linux/lib/keccak.c:

#include < linux/init.h >
#include < linux/module.h >
#include < linux/types.h >
#include < asm/unaligned.h >
#include < linux/keccak.h >

#define KECCAK_ROUNDS 24

#define ROTL64(x, y) ( ( (x) << (y) ) | ( (x) >> (64 - (y) ) ) )

static const u64 keccakf_rndc[24] = {
	0x0000000000000001ULL, 0x0000000000008082ULL, 0x800000000000808aULL,
	0x8000000080008000ULL, 0x000000000000808bULL, 0x0000000080000001ULL,
	0x8000000080008081ULL, 0x8000000000008009ULL, 0x000000000000008aULL,
	0x0000000000000088ULL, 0x0000000080008009ULL, 0x000000008000000aULL,
	0x000000008000808bULL, 0x800000000000008bULL, 0x8000000000008089ULL,
	0x8000000000008003ULL, 0x8000000000008002ULL, 0x8000000000000080ULL,
	0x000000000000800aULL, 0x800000008000000aULL, 0x8000000080008081ULL,
	0x8000000000008080ULL, 0x0000000080000001ULL, 0x8000000080008008ULL
};

static const int keccakf_rotc[24] = {
	1,  3,	6,  10, 15, 21, 28, 36, 45, 55, 2,  14,
	27, 41, 56, 8,	25, 43, 62, 18, 39, 61, 20, 44
};

static const int keccakf_piln[24] = {
	10, 7,	11, 17, 18, 3, 5,  16, 8,  21, 24, 4,
	15, 23, 19, 13, 12, 2, 20, 14, 22, 9,  6,  1
};

/* update the state with given number of rounds */

static void keccakf(u64 st[25])
{
	int i, j, round;
	u64 t, bc[5];

	for (round = 0; round < KECCAK_ROUNDS; round++) {

		/* Theta */
		for (i = 0; i < 5; i++)
			bc[i] = st[i] ^ st[i + 5] ^ st[i + 10] ^ st[i + 15]
				^ st[i + 20];

		for (i = 0; i < 5; i++) {
			t = bc[(i + 4) % 5] ^ ROTL64(bc[(i + 1) % 5], 1);
			for (j = 0; j < 25; j += 5)
				st[j + i] ^= t;
		}

		/* Rho Pi */
		t = st[1];
		for (i = 0; i < 24; i++) {
			j = keccakf_piln[i];
			bc[0] = st[j];
			st[j] = ROTL64(t, keccakf_rotc[i]);
			t = bc[0];
		}

		/* Chi */
		for (j = 0; j < 25; j += 5) {
			for (i = 0; i < 5; i++)
				bc[i] = st[j + i];
			for (i = 0; i < 5; i++)
				st[j + i] ^= (~bc[(i + 1) % 5]) &
					     bc[(i + 2) % 5];
		}

		/* Iota */
		st[0] ^= keccakf_rndc[round];
	}
}

This code is taken from sha3_generic.c verbatim, quick investigation showed that it is equivalent to what Diana's Keccak implementation does.

void keccak_init(struct keccak_state *sctx)
{
	memset(sctx, 0, sizeof(*sctx));
	sctx->rsiz = KECCAK_BLOCK_SIZE;
	sctx->rsizw = sctx->rsiz / 8;
}

Clears the structure and sets some constants. Properly speaking rsiz and rsizw can be fixed at compile-time, but I decided against doing so for now, to allow choosing other rates in case they will ever be required.

int keccak_update(struct keccak_state *sctx, const u8 *data,
			unsigned int len)
{
	unsigned int done;
	const u8 *src;

	done = 0;
	src = data;

	if ( (sctx->partial + len) > (sctx->rsiz - 1) ) {
		if (sctx->partial) {
			done = -sctx->partial;
			memcpy(sctx->buf + sctx->partial, data,
			       done + sctx->rsiz);
			src = sctx->buf;
		}

		do {
			unsigned int i;

			for (i = 0; i < sctx->rsizw; i++)
				sctx->st[i] ^= get_unaligned_le64(src + 8 * i);
			keccakf(sctx->st);

			done += sctx->rsiz;
			src = data + done;
		} while (done + (sctx->rsiz - 1) < len);

		sctx->partial = 0;
	}
	memcpy(sctx->buf + sctx->partial, src, len - done);
	sctx->partial += (len - done);

	return 0;
}

Also taken verbatim from sha3_generic.c.

int keccak_final(struct keccak_state *sctx, u8 *out, size_t out_len)
{
	unsigned int i;
	unsigned int inlen = sctx->partial;
	size_t hashed = 0;

	sctx->buf[inlen++] = 0x01;
	memset(sctx->buf + inlen, 0, sctx->rsiz - inlen);
	sctx->buf[sctx->rsiz - 1] |= 0x80;

	for (i = 0; i < sctx->rsizw; i++) {
		sctx->st[i] ^= get_unaligned_le64(sctx->buf + 8 * i);
	}

	while (hashed < out_len) {
		u64 st[KECCAK_STATE_SIZE] = {0};
		size_t nbytes = min_t(size_t, sctx->rsiz, out_len - hashed);
		keccakf(sctx->st);
		for (i = 0; i < KECCAK_STATE_SIZE; i++) {
			st[i] = cpu_to_le64(sctx->st[i]);
		}
		memcpy(&out[hashed], &st[0], nbytes);
		hashed += nbytes;
	}
	memset(sctx, 0, sizeof(*sctx));
	return 0;
}

This is where my changes went in: the padding is changed from SHA3 to Keccak mode, the output of arbitrary-sized hashes is enabled by mixing and squeezing in a loop - for SHA3 a single iteration is always enough.

Now, lib/Makefile:

@@ -18,7 +18,7 @@

 lib-y := ctype.o string.o vsprintf.o cmdline.o \
         rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-        idr.o int_sqrt.o extable.o \
+        idr.o int_sqrt.o extable.o keccak.o \
         sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
         flex_proportions.o ratelimit.o show_mem.o \
         is_single_threaded.o plist.o decompress.o kobject_uevent.o \

Keccak is always unconditionally enabled.

The vpatch and seal are available here:

curl 'http://bvt-trace.net/vpatches/linux-keccak.vpatch' > linux-keccak.vpatch
curl 'http://bvt-trace.net/vpatches/linux-keccak.vpatch.bvt.sig' > linux-keccak.vpatch.bvt.sig

Benchmarking Code

Now, let's move on to the vpatch with actual RNG code. The code provided here most likely can be deduplicated and cleaned up, however my intention for this vpatch was to provide a reference point for others to reproduce the measurements, not to provide something that should go into the end-user system as-is.

typedef int (*Fn_Good_Hash)(const unsigned char __user *data, size_t data_len);
typedef int (*Fn_Fast_Hash)(const unsigned char __user *data, size_t data_len, size_t output_len);
typedef void (*Fn_Self_Hash)(size_t to_hash);

static int hash_good_sha1(const unsigned char __user *data, size_t data_len);
static int hash_fast_sha1(const unsigned char __user *data, size_t data_len, size_t output_len);
static void _outer_self_hash_sha1(size_t to_hash);

static int hash_good_chacha20(const unsigned char __user *data, size_t data_len);
static int hash_fast_chacha20(const unsigned char __user *data, size_t data_len, size_t output_len);
static void _outer_self_hash_chacha20(size_t to_hash);

static int hash_good_serpent(const unsigned char __user *data, size_t data_len);
static int hash_fast_serpent(const unsigned char __user *data, size_t data_len, size_t output_len);
static void _outer_self_hash_serpent(size_t to_hash);

static int hash_good_keccak(const unsigned char __user *data, size_t data_len);
static int hash_fast_keccak(const unsigned char __user *data, size_t data_len, size_t output_len);
static void _outer_self_hash_keccak(size_t to_hash);

Handy typedefs and forward declarations of functions with the necessary interfaces.

static unsigned current_hashing_impl = 0;
static struct hashing_impl {
	const char   *name;
	Fn_Good_Hash fn_good;
	Fn_Fast_Hash fn_fast;
	Fn_Self_Hash fn_self;
} hashing_impls[] = {
	{"default", hash_good_sha1, hash_fast_chacha20, _outer_self_hash_sha1},
	{"sha1", hash_good_sha1, hash_fast_sha1, _outer_self_hash_sha1},
	{"serpent", hash_good_serpent, hash_fast_serpent, _outer_self_hash_serpent},
	{"chacha20", hash_good_chacha20, hash_fast_chacha20, _outer_self_hash_chacha20},
	{"keccak", hash_good_keccak, hash_fast_keccak, _outer_self_hash_keccak},
};

static int
set_hashing_impl(int val)
{
	unsigned long flags;

	if (val < 0 || val >= ARRAY_SIZE(hashing_impls)) {
		printk("Fail to set hashing implementation %d\n", val);
		return -EINVAL;
	}

	spin_lock_irqsave(&outer_lock, flags);
	current_hashing_impl = val;
	spin_unlock_irqrestore(&outer_lock, flags);
	printk("Set RNG hash function set to %s\n", hashing_impls[val].name);
	return 0;
}

A table with different sets of hash functions, and a function to switch between them. The "default" variant corresponds to the function set that the first FG-RNG vpatch used.

static int
hash_good(const unsigned char __user *data, size_t data_len)
{
	return hashing_impls[current_hashing_impl].fn_good(data, data_len);
}

static int
hash_fast(const unsigned char __user *data, size_t data_len, size_t output_len)
{
	return hashing_impls[current_hashing_impl].fn_fast(data, data_len, output_len);
}

static void
_outer_self_hash(size_t to_hash)
{
	hashing_impls[current_hashing_impl].fn_self(to_hash);
}

The good-, fast-, and self-hash functions redirect to the corresponding function in the currently selected set.

Good hashes

static int
hash_good_sha1(const unsigned char __user *data, size_t data_len)
{
	union sha1state hash;
	__u32 workspace[SHA_WORKSPACE_WORDS];
	u8 s_block[SHA_MESSAGE_BYTES];
	size_t nbytes;
	size_t hashed = 0;

	sha_init(hash.w);
	while (hashed < data_len) {
		nbytes = data_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		memset(s_block, 0, sizeof(s_block));
		if (copy_from_user(s_block, data + hashed, nbytes)) {
			return -EFAULT;
		}
		sha_transform(hash.w, s_block, workspace);
		kfifo_in(&outer_ring,  &hash.c[0], sizeof(hash));
		hashed += nbytes;
	}
	return 0;
}

SHA1 as good hash. This function is unchanged.

static struct serpent_ctx zero_key_serpent;
static int rand_initialize(void)
{
	/* s_key can be customizable in future */
	u8 s_key[SERPENT_MAX_KEY_SIZE] = {0};
	__serpent_setkey(&zero_key_serpent, &s_key[0], sizeof(s_key));
	return 0;
}
early_initcall(rand_initialize);

static int
hash_good_serpent(const unsigned char __user *data, size_t data_len)
{
	struct serpent_ctx sctx;
	u8 s_block[SERPENT_BLOCK_SIZE];
	size_t nbytes;
	size_t hashed = 0;

	sctx = zero_key_serpent;
	while (hashed < data_len) {
		nbytes = data_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		memset(s_block, 0, sizeof(s_block));
		if (copy_from_user(s_block, data + hashed, nbytes)) {
			return -EFAULT;
		}
		__serpent_encrypt(&sctx, s_block, s_block);
		kfifo_in(&outer_ring,  &s_block[0], sizeof(s_block));
		hashed += nbytes;
	}
	return 0;
}

Serpent as good hash. A block of data is encrypted to produce a hashed byte block. The key expansion is moved to an initialization function - which in the next vpatch will also allow setting per-user key - for the selected algorithm, which may not necessarily be Serpent.

static int
hash_good_chacha20(const unsigned char __user *data, size_t data_len)
{
	u32 ctx[16] = {0};
	u8 s_block[CHACHA20_BLOCK_SIZE];
	size_t nbytes;
	size_t hashed = 0;

	while (hashed < data_len) {
		nbytes = data_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		memset(s_block, 0, sizeof(s_block));
		if (copy_from_user(s_block, data + hashed, nbytes)) {
			return -EFAULT;
		}
		chacha20_block(&ctx[0], s_block);
		kfifo_in(&outer_ring,  &s_block[0], nbytes);
		hashed += nbytes;
	}
	return 0;
}

ChaCha20 as good hash. Almost identical to Serpent implementation.

static int
hash_good_keccak(const unsigned char __user *data, size_t data_len)
{
	struct keccak_state kst;
	u8 s_block[KECCAK_BLOCK_SIZE];
	size_t nbytes;
	size_t hashed = 0;
	size_t squeezed = 0;

	keccak_init(&kst);
	while (hashed < data_len) {
		nbytes = data_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		if (copy_from_user(s_block, data + hashed, nbytes)) {
			return -EFAULT;
		}
		keccak_update(&kst, &s_block[0], nbytes);
                hashed += nbytes;
	}

	keccak_pad(&kst);
	while (squeezed < data_len) {
		nbytes = data_len - squeezed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		keccak_squeeze(&kst, s_block);
		kfifo_in(&outer_ring, s_block, nbytes);
		squeezed += nbytes;
	}
	return 0;
}

Keccak as good hash: first feeds in all available data into the Keccak state, then squeezes the state repeatedly to produce the necessary amount of bytes.

Fast hashes

static int
hash_fast_sha1(const unsigned char __user *data, size_t data_len, size_t output_len)
{
	union sha1state hash;
	__u32 workspace[SHA_WORKSPACE_WORDS];
	u8 s_block[SHA_MESSAGE_BYTES] = {0};
	size_t nbytes;
	size_t hashed = 0;

	data_len = min_t(size_t, data_len, sizeof(hash));
	if (copy_from_user(&hash.c[0], data, data_len)) {
		return -EFAULT;
	}

	while (hashed < output_len) {
		nbytes = output_len - hashed;
		nbytes = min_t(size_t, sizeof(hash), nbytes);
		sha_transform(hash.w, s_block, workspace);
		nbytes = kfifo_in(&outer_ring, &hash.c[0], nbytes);
		hashed += nbytes;
	}
	return 0;
}

SHA1 fast hash: initialize hash state from inner ring overflow data, feed in empty block, mix the hash into the outer ring.

static int
hash_fast_serpent(const unsigned char __user *data, size_t data_len, size_t output_len)
{
	struct serpent_ctx sctx;
	u8 s_key[SERPENT_MAX_KEY_SIZE];
	u8 s_block[SERPENT_BLOCK_SIZE] = {0};
	size_t nbytes;
	size_t hashed = 0;

	data_len = min_t(size_t, data_len, sizeof(s_key));
	if (copy_from_user(&s_key[0], data, data_len)) {
		return -EFAULT;
	}

	__serpent_setkey(&sctx, &s_key[0], sizeof(s_key));
	while (hashed < output_len) {
		nbytes = output_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		__serpent_encrypt(&sctx, s_block, s_block);
		nbytes = kfifo_in(&outer_ring,	&s_block[0], nbytes);
		hashed += nbytes;
	}
	return 0;
}

Serpent fast hash: uses inner ring data as key, repeatedly encrypts a block with the key, feeds encrypted data into the outer ring.

static int
hash_fast_chacha20(const unsigned char __user *data, size_t data_len, size_t output_len)
{
	u32 ctx[16] = {0};
	u8 s_block[CHACHA20_BLOCK_SIZE] = {0};
	size_t nbytes;
	size_t hashed = 0;

	data_len = min_t(size_t, data_len, sizeof(ctx));
	if (copy_from_user(&ctx[0], data, data_len)) {
		return -EFAULT;
	}

	while (hashed < output_len) {
		nbytes = output_len - hashed;
		nbytes = min_t(size_t, CHACHA20_BLOCK_SIZE, nbytes);
		chacha20_block(&ctx[0], s_block);
		nbytes = kfifo_in(&outer_ring, s_block, nbytes);
		hashed += nbytes;
	}
	return 0;
}

ChaCha20 implementation of fast hash is the same as for Serpent.

static int
hash_fast_keccak(const unsigned char __user *data, size_t data_len, size_t output_len)
{
	struct keccak_state kst;
	u8 s_block[KECCAK_BLOCK_SIZE] = {0};
	size_t nbytes;
	size_t hashed = 0;

        data_len = min_t(size_t, data_len, sizeof(s_block));
	if (copy_from_user(&s_block[0], data, data_len)) {
		return -EFAULT;
	}

	keccak_init(&kst);
	keccak_update(&kst, s_block, data_len);
	keccak_pad(&kst);
	while (hashed < output_len) {
		nbytes = output_len - hashed;
		nbytes = min_t(size_t, sizeof(s_block), nbytes);
		keccak_squeeze(&kst, s_block);
		nbytes = kfifo_in(&outer_ring, s_block, nbytes);
		hashed += nbytes;
	}
	return 0;
}

Keccak for fast hashing: absorb RNG data into Keccak state, then repeatedly squeeze the state until necessary number of bytes is produced.

Outer ring self-hashing functions

static void
_outer_self_hash_sha1(size_t to_hash)
{
	union sha1state hash;
	__u32 workspace[SHA_WORKSPACE_WORDS];
	u8 s_block[SHA_MESSAGE_BYTES];
	size_t nbytes, peeked;
	size_t hashed = 0;

	sha_init(hash.w);
	while (hashed < to_hash) {
		memset(s_block, 0, sizeof(s_block));
		nbytes = min_t(size_t, sizeof(hash), (to_hash - hashed));
		peeked = kfifo_out_peek_forward(&outer_ring, s_block, nbytes);
		sha_transform(hash.w, s_block, workspace);
		nbytes = kfifo_in(&outer_ring, &hash.c[0], nbytes);
		hashed += nbytes;
	}
}

Peek existing bytes from the outer ring, use them to update the hash state, add hash state into the outer ring.

static void
_outer_self_hash_serpent(size_t to_hash)
{
	struct serpent_ctx sctx;
	u8 s_block[SERPENT_BLOCK_SIZE];
	size_t nbytes, peeked;
	size_t hashed = 0;

	sctx = zero_key_serpent;
	while (hashed < to_hash) {
		memset(s_block, 0, sizeof(s_block));
		nbytes = min_t(size_t, sizeof(s_block), (to_hash - hashed));
		peeked = kfifo_out_peek_forward(&outer_ring, s_block, nbytes);
		__serpent_encrypt(&sctx, s_block, s_block);
		nbytes = kfifo_in(&outer_ring, &s_block[0], nbytes);
		hashed += nbytes;
	}
}

Same as SHA1 code, but encrypts the bytes with zero key instead of hashing.

static void
_outer_self_hash_chacha20(size_t to_hash)
{
	u32 ctx[16] = {0};
	u8 s_block[CHACHA20_BLOCK_SIZE];
	size_t nbytes, peeked;
	size_t hashed = 0;

	while (hashed < to_hash) {
		memset(s_block, 0, sizeof(s_block));
		nbytes = min_t(size_t, sizeof(s_block), (to_hash - hashed));
		peeked = kfifo_out_peek_forward(&outer_ring, s_block, nbytes);
		chacha20_block(&ctx[0], s_block);
		nbytes = kfifo_in(&outer_ring, &s_block[0], nbytes);
		hashed += nbytes;
	}
}

Same as Serpent, but uses ChaCha20 instead.

static void
_outer_self_hash_keccak(size_t to_hash)
{
	struct keccak_state kst;
	u8 s_block[KECCAK_BLOCK_SIZE];
	size_t nbytes, peeked;
	size_t hashed = 0;

	keccak_init(&kst);
	while (hashed < to_hash) {
		nbytes = min_t(size_t, sizeof(s_block), (to_hash - hashed));
		peeked = kfifo_out_peek_forward(&outer_ring, s_block, nbytes);
		keccak_update(&kst, &s_block[0], nbytes);
		keccak_pad(&kst);
		keccak_squeeze(&kst, s_block);
		nbytes = kfifo_in(&outer_ring, s_block, nbytes);
		hashed += nbytes;
	}
}

Keccak self-hashing is a bit more tricky - we have to pad the data before squeezing, so the implementation 1. peeks the data, 2. absorbs and pads the state, 3. squeezes out a block of data. This process is repeated until the necessary number of bytes is pushed into the outer ring.

@@ -334,7 +703,7 @@

 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
-       int size, ent_count;
+       int size, ent_count, val;
        int __user *p = (int __user *)arg;
        int retval;

@@ -346,7 +715,11 @@
                        return -EFAULT;
                return 0;
        case RNDADDTOENTCNT:
-               return 0;
+               /* This is temporary hijacked to set the function index */
+               if (get_user(val, p)) {
+                       return -EFAULT;
+               }
+               return set_hashing_impl(val);
        case RNDADDENTROPY:
                if (!capable(CAP_SYS_ADMIN))
                        return -EPERM;

I have hijacked the RNDADDTOENTCNT ioctl for switching the hashing function set. As I mentioned, this is just for measurement.

void
keccak_pad(struct keccak_state *sctx)
{
       unsigned int i;
       unsigned int inlen = sctx->partial;

       sctx->buf[inlen++] = 0x01;
       memset(sctx->buf + inlen, 0, sctx->rsiz - inlen);
       sctx->buf[sctx->rsiz - 1] |= 0x80;

       for (i = 0; i < sctx->rsizw; i++) {
               sctx->st[i] ^= get_unaligned_le64(sctx->buf + 8 * i);
       }

}

void
keccak_squeeze(struct keccak_state *sctx, u8 *out)
{
       unsigned int i;
       u64 st[KECCAK_STATE_SIZE] = {0};
       keccakf(sctx->st);
       for (i = 0; i < KECCAK_STATE_SIZE; i++) {
               st[i] = cpu_to_le64(sctx->st[i]);
       }
       memcpy(&out, &st[0], sctx->rsiz);
}

To implement Keccak good-, fast-, and self-hashing, I needed an interface more flexible than keccak_final, so I have split its functionality into separate keccak_pad and keccak_squeeze.

Update to userspace apps

#include <stdlib.h>
#include <stdio.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <linux/random.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[])
{
       int impl;
       int fd;
       int r;

       if (argc != 2) {
               fprintf(stderr, "usage: %s impl_no\n", argv[0]);
               exit(EXIT_FAILURE);
       }
       fd = open("/dev/random", O_RDWR);
       if (fd == -1) {
               fprintf(stderr, "Error opening file: %s\n", strerror(errno));
               exit(EXIT_FAILURE);
       }
       impl = atoi(argv[1]);
       r = ioctl(fd, RNDADDTOENTCNT, &impl);
       if (r == -1) {
               fprintf(stderr, "Error setting hash implementation: %s\n", strerror(errno));
               exit(EXIT_FAILURE);
       }
       return 0;
}

A program to change the hash function sets from userspace, in linux/tools/rng/change-hash-fns.c.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <termios.h>
#include <unistd.h>
#include <errno.h>

int init_fg(const char *devpath)
{
    int r;
    int fd;
    struct termios term;

    fd = open(devpath, O_RDONLY);
    if (fd == -1) {
        fprintf(stderr, "Error opening file: %s\n", strerror(errno));
        exit(EXIT_FAILURE);
    }
    r = tcgetattr(fd, &term);
    if (r == -1) {
        fprintf(stderr, "Error getting tty attributes: %s\n", strerror(errno));
        exit(EXIT_FAILURE);
    }
    cfmakeraw(&term);
    term.c_cc[VMIN] = 255;
    r = cfsetspeed(&term, B115200);
    if (r == -1) {
        fprintf(stderr, "Error setting serial speed: %s\n", strerror(errno));
        exit(EXIT_FAILURE);
    }
    r = tcsetattr(fd, TCSANOW, &term);
    if (r == -1) {
        fprintf(stderr, "Error setting tty attributes: %s\n", strerror(errno));
        exit(EXIT_FAILURE);
    }
    return fd;
}

int main(int argc, char *argv[])
{
    int in, out;
    if (argc != 2) {
        fprintf(stderr, "usage: %s FUCKGOATS-tty\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    in = init_fg(argv[1]);
    out = open("/dev/random", O_WRONLY);
    while (1) {
        char buf[4096];
        ssize_t nread = 0;
        while (nread != sizeof(buf)) {
            usleep(500000);
            ssize_t l = read(in, &buf[nread], sizeof(buf) - nread);
            if (l == -1) {
                fprintf(stderr, "Read failed: %s\n", strerror(errno));
                exit(EXIT_FAILURE);
            }
            nread += l;
        }
        write(out, &buf[0], nread);
    }
    return 0;
}

The updated /dev/random feeder. At the initial report time, I have figured out how VMIN works, and reported success, prematurely. To further reduce CPU utilization, all other bold lines were added/updated. FG has a rate of about 7.2 kB/s, which means that around 4k of data will arrive every half a second (constant in usleep), provided that kernel buffers this much of data - and it does1. As a result, the buffer is typically filled in one read operation. The same could not be achieved with only VMIN, because it is an 8-bit variable, and 255 is the maximum amount of data that it can arrange for reading in one call.

Measurement results

First, performance impact of the new way of reading from TTY. I have used

dd if=/dev/random of=/dev/null bs=128 count=100000

As workload, and measured the CPU consumption. The results are 1.3% for the scheme with only term.c_cc[VMIN] = 255 line added, and 0% (below noise margin) with the new scheme. I did verify with strace the the whole byte block is read.

The results stayed the same (0%) when inner ring was full and FG data was being processed with good hash function, for all hash function implementations. So no drawback from using keccak for good hash function.

Next, let's measure the performance of self-hashing (no feeding of data from FG in this experiment). Here, the reader is CPU-limited - the utilization will always be 100%, and the important metric is throughput. The workload is:

# dd if=/dev/urandom of=/dev/null bs=4096 count=10000;

repeated several times and averaged.

The results are:

Implementation Throughput, MB/s
SHA1 117
Serpent 76.5
ChaCha20 331
Keccak 23.4

So Keccak has comparably low throughput in self-hashing mode, which may be a limiting factor for some applications. Perhaps this is where a tunable knob (per-application is possible) might be useful - to allow selecting a fast hash if necessary. For my needs though 24 MB/s is more than enough.

I did not find a good way to evaluate the fast hash function: in normal operation mode, fast hashing is used only rarely, and manifests as a small CPU utilization spike, the exact number is hard to capture precisely. If there is an active reader from urandom that somehow requires GBs of data, the impact of self-hashing may be higher, though. Indirectly the impact of fast hashing may be estimated from the performance of self-hashing: Keccak in self-hashing mode executes Keccak-1600 mixer twice per KECCAK_BLOCK_SIZE bytes - once in pad, one in squeeze. In fast hashing, it is executed is executed once per KECCAK_BLOCK_SIZE bytes, so the throughput should be ~2*23.4 MB/s. Fast hashing would typically have to fill 1 MB of outer ring per write, there will be ~two writes per second, so 2 MB. 2 MB/s /(2*23.4) MB/s = 0.04 [Erlang?] = 4% for Keccak.

Last measurement: rate for filling a buffer inside kernel using a good hash. The workload is

# cat /proc/sys/kernel/random/outer_ring_avail; sleep 60; cat /proc/sys/kernel/random/outer_ring_avail

repeated several times.

The results are:

Implementation Fill rate, kB/s
SHA1 2.15
Serpent 7.01
ChaCha20 7.01
Keccak 7.00

So, with all hashes functions modulo SHA1, there is no fill rate reduction compared to the case of 'no hash'.

The story of the hang at boot bug may be also interesting. The problem was that I mistakingly copied comparably large amount of data to the location of the pointer, not to the location which the pointer indicates - i.e. memcpy(&ptr, from, ALOT) instead of memcpy(ptr, from, ALOT). With C, this stuff can happen, however I would expect that the crash would be reproducible on the VM that I used for testing - it was not. Turned out that on the toiletbox VM host the kernel was compiled with gcc5, not with gcc4 as used on a bare-metal test machine. GCC5 identified that the code was crap, but did not complain, and just optimized the memcpy away, preventing the crash...

Anyhow, the vpatch for benchmarking code is here:

curl 'http://bvt-trace.net/vpatches/linux-benchmark.vpatch' > linux-benchmark.vpatch
curl 'http://bvt-trace.net/vpatches/linux-benchmark.vpatch.bvt.sig' > linux-benchmark.vpatch.bvt.sig
  1. I did not investigate what is the upper limit of in-kernel buffering. []