v.sh part 1: wrapper script

February 1st, 2020

Before touching new stuff, I would like to use the opportunity to fix some benign errors in vtools: gprbuild file of vdiff forced build of ksum.adb, and the unused Fs variable in ksum.adb.

curl 'http://bvt-trace.net/vpatches/vtools_small_fixes.vpatch' > vtools_small_fixes.vpatch
curl 'http://bvt-trace.net/vpatches/vtools_small_fixes.vpatch.bvt.sig' > vtools_small_fixes.vpatch.bvt.sig

Introduction

Over the years, the republic has seen several implementations of V-pressers, in different programming languages and with different performance1. It would seem that there is ample choice of pressers to use, so why would one write a new one? I took several things into account:

  • All current pressers use some sort of rather large scripting language. Typically, choice of such language is not long contemplated: some are chosen because they commonly available on all distributions, like Perl and Python, while CL or Scheme may get used because they are dear to the heart of the developer. However, starting the distribution development from scratch (think TMSR OS), it is not wise to import a scripting language without a serious consideration. OTOH, a V implemenatation must be provided there. Given that the presser is the only tool that uses a scripting language ATM, implementing it in a minimal-dependency language is something I find important.
  • Another reason, important for me personally, was poor performance of V on the larger codebases. There is no need to allocate 4Gb of memory and work for an hour to calculate a flow of a single vpatch.

To address the first point, a language present in a minimal-dependency system has to be picked. These are POSIX shell, awk, C, Ada. I did not seriously consider C because it is strictly inferior to Ada, with a single benefit of being available on Gales out of the box. Given that a lot of software is written in Ada already, this should not be a controversial choice. The weak and strong points of each language in context of v-presser are:

  1. The main wrapper calls out to shell utilities a lot, so POSIX shell is an obvious choice here. It is also notorious for its extremely poor array support.
  2. Ada, on the other hand, is well suited for implementing the underlying algorithms, and shines in the implementation of algorithms over arrays. It has huge problems with string processing though. Calling out to external utilities from Ada is much more verbose and painful than it needs to be. The performance of Ada code should be a few orders of magnitude better than in a scripting language.
  3. awk is good at filtering data, but using it to write a bigger application that shells out a lot would be awkward at best.

With this in mind, let's consider the structure of a new V implementation (v.sh):

  1. A main wrapper process written in shell, which drives the execution of all other tools.
  2. A set of applications in Ada, one for extracting information out of vpatches (vfilter), calculation of dependencies among vpatches (vflow), for topological sorting based on the previously calculated dependency information (vtoposort), and for simplication of the dependency graph (vsimplify).
  3. awk is used to bridge strong sides of shell to strong sides of Ada. There are a lot of awk scripts embedded into the main shell wrapper, but no freestanding ones.

The writeup will be split over 2 articles:

  1. I will introduce the wrapper shell script around all the tools.
  2. A set of Ada applications that will perform the dependency resolution across vpatches.

As the wrapper script depends on vpatch, all related vpatches are grown on vtools.

v.sh

#!/bin/sh

set -e

Make shell a bit more strict. Not much can be done here, shell remains shell.

vgpgdir=""
vpatches=""
flow=""
filtered=""
selfile=""
simplified=""
topo=""

These variables will contain paths to temporary files, all of which will be put in the directory $vgpgdir:

  1. $vpatches - mapping of vpatch paths to numeric ids.
  2. $flow - a file with dependencies and conflicts between vpatches.
  3. $filtered - a file with input to all Ada tools: mappings of hunks to vpatches. Fully numeric.
  4. $selfile - a file with numeric ids of vpatches selected for flow calculation.
  5. $simplified - a file with a cleaned up dependency tree (for building a dependency visualization).
  6. $topo - results of topological sort over the dependencies.
usage() {
	cat <<EOF
v patches                 -- list patches
v wot                     -- print WoT
v flow leaves             -- print patch flow
v graph dot-file svg-file -- graphviz plot of dependencies
v antecedents patch       -- list patch antecendents
v descendants patch       -- list patch descendants
v origin hash             -- find patch that introduced the hash
v press directory leaves  -- press all patches up to head
EOF
	exit 0
}

As you can see, the command selection is pretty standard. No automatic initialization/syncing with a remote code shelf though.

checkdir() {
	if [ ! -d "$1" ]; then
		echo "$1 directory not found"
		kill -TERM 0
	fi
}

Kill the application if a directory does not exist. Print the error message before killing.

loadwot() {
	vgpgdir="$(mktemp -d v-gpg-XXXXXX)"
	vpatches="$vgpgdir/vpatches.txt"
	flow="$vgpgdir/flow.txt"
	filtered="$vgpgdir/filtered.txt"
	selfile="$vgpgdir/selected.txt"
	simplified="$vgpgdir/simplified.txt"
	topo="$vgpgdir/topo.txt"

	checkdir wot
	for i in wot/*.asc; do
		gpg --homedir "$vgpgdir" --logger-fd 1 --batch
		--keyid-format=long --import $i >/dev/null 2>&1
	done
}

Creates the temporary GPG home directory, and initializes variables with paths to temporary files inside $vgpgdir. After that, tries to load the WoT from "wot" directory. I am expecting that people will ask to change "wot" to ".wot" as is the convention right now. I do not really like "hidden" files, though, so the presser I use on main machine uses "wot" and "seals".

cleanup() {
	if [ ! -z "$vgpgdir" ]; then
		rm -rf "$vgpgdir"
	fi
	exit 0
}
trap cleanup TERM INT

exit_msg() {
	echo $1
	cleanup
}

cleanup() deletes the temporary directory and exits. This functionality is invoked on TERM and INT signals. There is also exit_msg() function that prints the error message and exits with a cleanup.

verify() {
	vpatchpath="$1"
	vpatch="$(basename $vpatchpath)"
	signers=""

	checkdir seals
	for i in seals/$vpatch.*.sig; do
		signature="$(basename $i | sed 's/.sig$//g')"
		signer="${signature#$vpatch.}"
		if ! gpg --homedir "$vgpgdir" --logger-fd 1 --batch
			--keyid-format=long
			--verify "$i" "$vpatchpath" 2>&1 >/dev/null; then
			exit_msg "Bad signature $i for $vpatch"
		fi
		signers="$signers $signer"
	done

	[ -z "$signers" ] && exit_msg "No signatures for $vpatch"

	echo "$vpatch $signers" >> "$vgpgdir"/signers.txt
}

verify() checks that the provided vpatch matches the signatures. If the signature verification fails or if there are no signers of the vpatch, terminates the execution. It also records the signers of the vpatch in signers.txt in the temporary directory.

verifyall() {
	checkdir patches
	for i in patches/*.vpatch; do
		verify $i;
	done
}

Just verifies all vpatches in the patches directory.

filterall() {
	n=0

	for i in patches/*.vpatch; do
		vfilter "$n" < "$i"
		echo "$i $n" >> "$vpatches"
		n="$((n+1))"
	done
}

To calculate the dependencies across the vpatches, only hunks headers are necessary, not the actual contents of hunks. vfilter is the program that extracts the headers of the hunks, and prints them out in the following form:

vpatch-id in-file out-file in-hash out-hash

vpatch-id is passed as the first CLI argument to vfilter.

filterall() assigns each vpatch a numeric identifier, pipes the vpatch through vfilter program, and records vpatch-numerid id mapping in $vpatches file:

patches/vpatch-1.vpatch 0
patches/vpatch-2.vpatch 1
...

vfilter is currently implemented as an Ada application, though in practice it is equally feasible to be implemented in awk: the performance difference turned out to be 20 seconds for awk vs 15 seconds in Ada for filtering the linux-genesis.vpatch (~600Mb). The price that must be paid for shaving off these 5 seconds is cut-paste restructuring of vpatch.adb, and an extra 50-line Ada application. Is it worth an extra vpatch? I intend to provide both versions so far.

internall() {
	awk '
function dedupf(i) {if (!(i in a)) {a[i]=fcnt; fcnt++;} return a[i];}
function deduph(i) {if (!(i in b)) {b[i]=hcnt; hcnt++;} return b[i];}
BEGIN{fcnt=1;hcnt=0; deduph("false")}
{print $1, dedupf($2), dedupf($3), deduph($4), deduph($5)}'
}

However, the output of the filterall is still not good enough for Ada, which works best with numeric arrays. We perform some impedance matching in internall() by, well, interning the filenames and hashes using awk - replacing each of those with a numeric id. Nice thing is that all these mappings can be dropped immediately afterwards. We also make sure that the special hash value of "false" (file does not exist) gets the id of zero. The output of this function is:

vpatch-id in-file-id out-file-id in-hash-id out-hash-id

For example:

0 34 34 14 15

In practical terms, 600 Mb of linux-genesis.vpatch turns to 12 Mb after vfilter, and to 1.2 Mb after internall().

vpatchnos() {
	awk -v VF="$vpatches" '
FILENAME==VF {a[$1]=$2} FILENAME=="-" {print a[$1]}' "$vpatches" -
}

vpatchnames() {
	awk -v VF="$vpatches" '
FILENAME==VF {a[$2]=$1} FILENAME=="-" {print a[$1]}' "$vpatches" -
}

Utility functions for mapping vpatch names to their pre-recorded numeric ids and vice-versa.

vpatchflow() {
	[ -z "$*" ] && exit_msg "No leaves selected."
	filterall | internall >"$filtered"
	selected=""
	if [ "$*" == "-" ]; then
		selected=$(awk '{print $2}' < "$vpatches")
	else
		selected="$(echo $* | tr ' ' 'n' | vpatchnos)"
	fi
	vflow $(wc -l <"$filtered") $(wc -l <"$vpatches") <"$filtered" >"$flow"
	echo "$selected" | sed -e 's/^/s /g' - | sort | uniq >"$selfile"
	cat "$flow" "$selfile" | vtoposort > "$topo"
}

vpatchflow() calculates the dependencies among vpatches: first, it processes all vpatches through filterall-internall pipeline; then it determines which leaves are selected for pressing. Using "-" as a leaf selects all vpatches; otherwise, you can provide your selection as patches/A.vpatch, patches/B.vpatch - filename completion works. The filtered vpatches are fed into vflow application, which expects two CLI arguments: the count of hunks in all vpatches, and the count of vpatches; it reads the input from stdin. The output has the following format:

v vpatches-count
d descendant-id antecedent-id
...
c vpatch1-id vpatch2-id
...
  • First line is the same vpatches-count that was provided on the CLI, to save on the repetitive work, the next tool (vtoposort) will read it.
  • Lines starting with "d" list dependencies between vpatches.
  • Lines starting with "c" list mutually incompatible vpatches.

Then, selected vpatches form the second part of input to the vtoposort:

s selected-leaf-1-id
...

These two files are concatenated and passed to vtoposort; it's output is:

[C conflict1-id]
[C conflict2-id]
...
[L loop1-id]
[L loop2-id]
...
P press1-id
P press2-id
...

If there are any conflicting vpatches among the selected, they will be printed one per line, preceded by "C". Then, results of toposort will be printed: if the calculation was successful, the sorted order is printed preceded by "P", otherwise there is a loop and all vpatches in the loop are printed prefixed with "L".

vpatchflowerr() {
	ret=$(head -n1 "$topo" | awk '{print $1}')
	if [ "$ret" = "C" ]; then
		printf "nConflicting leaves selected:n"
		awk '$1 == "C" {print $2}' < "$topo" | vpatchnames | sort | uniq
		cleanup
	elif [ $ret == "L" ]; then
		echo "nLoop among vpatches:"
		awk '{print $2}' < "$topo" | vpatchnames
		cleanup
	elif [ "$ret" != "P" ]; then
		echo "Unknown error. Investigate in $vgpgdir:"
		cat "$topo"
		exit 0 # no cleanup
	fi
}

vpatchflowerr() detects errors in the results of toposort, and report them to the user.

[ -z "$1" ] && usage

At this point we can start looking at the command line arguments: print usage if v.sh is invoked without arguments.

loadwot

if [ "$1" == "wot" ]; then
    gpg --homedir "$vgpgdir" --logger-fd 1 --batch --keyid-format=long -k
    cleanup
fi

WoT is necessary for all other commands, so we can load it at this point. If this is all that was requested, print it and exit.

verifyall

Rest of the commands work with vpatches, verify them.

if [ "$1" == "patches" ]; then
	cat "$vgpgdir"/signers.txt
	cleanup
fi

Output of the "patches" command is as easy as output of the signers.txt (see verifyall()).

if [ "$1" == "graph" ]; then
	shift
	dotfile="$1"
	shift
	svgfile="$1"
	shift

	vpatchflow -
	cat "$flow" "$topo" | vsimplify > "$simplified"
	cat >"$dotfile" <<EOF
digraph {
node [shape=rectangle, color=black, style=bold]
EOF
	awk -v VF="$vpatches" '
FILENAME==VF {sub(".*/", "", $1);sub(".vpatch$","",$1);a[$2]=$1}
FILENAME=="-" && $1 == "d" {print """a[$3]"" -> ""a[$2]"""}
' "$vpatches" - <"$simplified" >>"$dotfile"
	echo "}" >> "$dotfile"
	dot -Tsvg $dotfile -o $svgfile
	# dot -Tcmapx $dotfile -o $svgfile.html-map
	cleanup
fi

graph command is a bit longer:

  1. First, consume the commmand line arguments.
  2. Calculate flow and toposort for all vpatches
  3. Feed flow and toposort to the vsimplify tool, that removes redudand connections from the graph. Visually:

    Figure 1. Before.



    Figure 2. After.

    vsimplify produces the following output:

    I descendant-id indirect-antecedent-id
    ...
    d descendant-id direct-antecedent-id
    

    The lines prefixed with "d" in the output of vsimplify (unlike in output of vflow) have a property that: ∀ v1 ∈ d(V) ∄ v2 ∈ d(V) : v1 ∈ I(v2)
    where:

    • d(V) is list of antecedents of vpatch V as per vsimplify output (d V A, d V B, ...)
    • I(V) is list of direct and indirect antecedents of vpatch V (calculated and printed by vsimplify prefixed with "I").
  4. This output is fed to an awk script that produces a graphviz file, graphviz creates an svg.
if [ "$1" == "antecedents" ]; then
	shift
	vpatch="$1"
	shift

	vpatchflow -
	vno=$(awk -v V="$vpatch" '$1==V {print $2}' < "$vpatches")
	cat "$flow" "$topo" | vsimplify > "$simplified"
	awk -v S="$simplified" -v T="$topo" -v N="$vno" '
FILENAME==S && ($1 == "I" || $1 == "d" ) && $2 == N {a[$3]=1}
FILENAME==T && ($1 == "P") {if ($2 in a) {print $2}}
' "$simplified" "$topo" | vpatchnames
	cleanup
fi

if [ "$1" == "descendants" ]; then
	shift
	vpatch="$1"
	shift

	vpatchflow -
	vno=$(awk -v V="$vpatch" '$1==V {print $2}' < "$vpatches")
	cat "$flow" "$topo" | vsimplify > "$simplified"
	awk -v S="$simplified" -v T="$topo" -v N="$vno" '
FILENAME==S && ($1 == "I" || $1 == "d" ) && $3 == N {a[$2]=1}
FILENAME==T && ($1 == "P") {if ($2 in a) {print $2}}
' "$simplified" "$topo" | vpatchnames
	cleanup
fi

These two commands use direct+indirect antecedents calculated by vsimplify to print antecedents or descendants in the press order.

if [ "$1" == "origin" ]; then
	shift
	hash="$1"
	shift

	filterall >"$filtered"
	awk -v hash="$hash" '$5 == hash {print $1}' <"$filtered" | vpatchnames
	cleanup
fi;

Origin does not need to intern anything, it selects the hash out of filtered but not interned vpatches.

if [ "$1" == "flow" ]; then
	shift

	vpatchflow $*

	echo "Press order:"
	awk '$1 == "P" {print $2}' < "$topo" | vpatchnames

	vpatchflowerr
	cleanup
fi

Calculate flow according to the selected vpatches, output it, but also report errors.

if [ "$1" == "press" ]; then
	shift
	dir="$1"
	shift

	[ -d "$dir" ] && exit_msg "Directory $dir already exists"
	mkdir -p "$dir"

	vpatchflow $*
	vpatchflowerr

	awk '{print $2}' < "$topo" | vpatchnames | while read i; do
		i=$(readlink -f $i)
		cd "$dir"
		vpatch <"$i" >/dev/null
		cd - >/dev/null
	done
	cleanup
fi

Press is also trivial at this point: read arguments, create press directory, calculate press order, press according to the order.

cleanup

Fin.

curl 'http://bvt-trace.net/vpatches/vtools_add_vsh.vpatch' > vtools_add_vsh.vpatch
curl 'http://bvt-trace.net/vpatches/vtools_add_vsh.vpatch.bvt.sig' > vtools_add_vsh.vpatch.bvt.sig
  1. Recently, Diana Coman has provided an extensive overview of the state of V. []

10 Responses to “v.sh part 1: wrapper script”

  1. Diana Coman says:
    1

    I am expecting that people will ask to change "wot" to ".wot"

    No, I really prefer wot (and seals) too! I don't really see any specific reason to have those as hidden directories anyway and I'm not aware of any rationale or discussion on this previously either.

    At a first read, it looks quite good to me. Looking forward to the rest of the series too.

  2. This is very neat. I have a yet-unfinished "vtron in posix utils" to go with my PGP sig eater in posix utils, but it is -- just as you predicted -- limited in the max weight of vpatch that it can eat.

    Seems to me that the simpler solution to "awk chokes on N GB vpatches" would be to fix awk though, rather than write a new util? Normally this would be impractical, but if you're baking a new linux kit, why not fix awk ?

  3. bvt says:
    3

    @Diana
    Thank you, good to hear that I'm not alone in this preference. The next post should come in ~3 days.

    @Stanislav
    Thanks. Re awk problem: I have never seriously considered awk for the core of the application -- I would have rather used shell+Ada if this was feasible; awk works great if the implemented application becomes short enough and data filtering is the task, but for numeric arrays Ada is much better. But in the context of vfilter and internall - awk anything but chokes, I was pleasantly surprised by it's performance.

  4. @bvt: Got it. (And, similarly, my PGP-eater also requires GNAT to be present on the machine (to have built Peh.))

  5. > A set of applications in Ada, one for extracting information out of vpatches (vfilter), calculation of dependencies among vpatches (vflow), for topological sorting based on the previously calculated dependency information (vtoposort), and for simplication of the dependency graph (vsimplify).

    For some reason I was expecting to see a hasher in there as well, vkeccak or w/e :p

    > No automatic initialization/syncing with a remote code shelf though.

    Really, why not ? I must specifically curl myself, why ? Can't it just have a

    v mirror -- copy the given path locally.

    such that v mirror "http://ossasepia.com/reference-code-shelf/" results in all the .vpatch and .sig files encountered dumped in a directory named for the domain ?

    > I do not really like "hidden" files, though

    I'm not much of a fan either. If memory serves it's some lulz introduced back when overoptimistic development attempted to make "a linux windows" (= ubuntu) on the flimsy (and meanwhile thoroughly disabused) hope that such would result into a linuxification of the windowstards, rather than a windowtardation of linux.

    But all that aside : no mention of manifest file ?

  6. bvt says:
    6

    For some reason I was expecting to see a hasher in there as well, vkeccak or w/e :p

    Myeah, I was thinking about merging all of the tools into one, perhaps later. Re vkeccak, vpatch checks input and the output hashes, so there is no need for extra verification. And even in this case, ksum would do.

    Really, why not ? I must specifically curl myself, why ? Can't it just have a

    v mirror -- copy the given path locally.

    such that v mirror "http://ossasepia.com/reference-code-shelf/" results in all the .vpatch and .sig files encountered dumped in a directory named for the domain ?

    I am not against this, but I have never personally seen the need for myself; I can add this later.

    I'm not much of a fan either. If memory serves it's some lulz introduced back when overoptimistic development attempted to make "a linux windows" (= ubuntu) on the flimsy (and meanwhile thoroughly disabused) hope that such would result into a linuxification of the windowstards, rather than a windowtardation of linux.

    IIRC original Unix team wanted to prevent ls from outputting "." (this directory) and ".." (parent directory), but made a bug - did not check if there are other symbols after the initial dots, and later this behavior got standardized. The windowstardation must be the extended attributes added for Samba/CIFS.

    But all that aside : no mention of manifest file ?

    The manifest file is there, but was nothing to write about, so I dropped it in the writeup.

  7. [...] his awk + posix shell + Ada Vtron in a 3 part series (1, 2, [...]

  8. [...] in scheduling errands first thing in the morning. [↩]I'm seeing something like bvt's v.sh as the way forward but the tried and mostly-true perl one as a present necessity. [...]

  9. [...] a thing would still require Perl, itself not an insignificant liability. While work had been underway to replace that, the results fell short of completeness, and from the ensuing discussion I decided [...]

Leave a Reply