Before touching new stuff, I would like to use the opportunity to fix some benign errors in vtools: gprbuild file of vdiff forced build of ksum.adb, and the unused Fs variable in ksum.adb.
curl 'http://bvt-trace.net/vpatches/vtools_small_fixes.vpatch' > vtools_small_fixes.vpatch curl 'http://bvt-trace.net/vpatches/vtools_small_fixes.vpatch.bvt.sig' > vtools_small_fixes.vpatch.bvt.sig
Introduction
Over the years, the republic has seen several implementations of V-pressers, in different programming languages and with different performance1. It would seem that there is ample choice of pressers to use, so why would one write a new one? I took several things into account:
- All current pressers use some sort of rather large scripting language. Typically, choice of such language is not long contemplated: some are chosen because they commonly available on all distributions, like Perl and Python, while CL or Scheme may get used because they are dear to the heart of the developer. However, starting the distribution development from scratch (think TMSR OS), it is not wise to import a scripting language without a serious consideration. OTOH, a V implemenatation must be provided there. Given that the presser is the only tool that uses a scripting language ATM, implementing it in a minimal-dependency language is something I find important.
- Another reason, important for me personally, was poor performance of V on the larger codebases. There is no need to allocate 4Gb of memory and work for an hour to calculate a flow of a single vpatch.
To address the first point, a language present in a minimal-dependency system has to be picked. These are POSIX shell, awk, C, Ada. I did not seriously consider C because it is strictly inferior to Ada, with a single benefit of being available on Gales out of the box. Given that a lot of software is written in Ada already, this should not be a controversial choice. The weak and strong points of each language in context of v-presser are:
- The main wrapper calls out to shell utilities a lot, so POSIX shell is an obvious choice here. It is also notorious for its extremely poor array support.
- Ada, on the other hand, is well suited for implementing the underlying algorithms, and shines in the implementation of algorithms over arrays. It has huge problems with string processing though. Calling out to external utilities from Ada is much more verbose and painful than it needs to be. The performance of Ada code should be a few orders of magnitude better than in a scripting language.
- awk is good at filtering data, but using it to write a bigger application that shells out a lot would be awkward at best.
With this in mind, let's consider the structure of a new V implementation (v.sh):
- A main wrapper process written in shell, which drives the execution of all other tools.
- A set of applications in Ada, one for extracting information out of vpatches (vfilter), calculation of dependencies among vpatches (vflow), for topological sorting based on the previously calculated dependency information (vtoposort), and for simplication of the dependency graph (vsimplify).
- awk is used to bridge strong sides of shell to strong sides of Ada. There are a lot of awk scripts embedded into the main shell wrapper, but no freestanding ones.
The writeup will be split over 2 articles:
- I will introduce the wrapper shell script around all the tools.
- A set of Ada applications that will perform the dependency resolution across vpatches.
As the wrapper script depends on vpatch, all related vpatches are grown on vtools.
v.sh
#!/bin/sh set -e
Make shell a bit more strict. Not much can be done here, shell remains shell.
vgpgdir="" vpatches="" flow="" filtered="" selfile="" simplified="" topo=""
These variables will contain paths to temporary files, all of which will be put in the directory $vgpgdir:
- $vpatches - mapping of vpatch paths to numeric ids.
- $flow - a file with dependencies and conflicts between vpatches.
- $filtered - a file with input to all Ada tools: mappings of hunks to vpatches. Fully numeric.
- $selfile - a file with numeric ids of vpatches selected for flow calculation.
- $simplified - a file with a cleaned up dependency tree (for building a dependency visualization).
- $topo - results of topological sort over the dependencies.
usage() { cat <<EOF v patches -- list patches v wot -- print WoT v flow leaves -- print patch flow v graph dot-file svg-file -- graphviz plot of dependencies v antecedents patch -- list patch antecendents v descendants patch -- list patch descendants v origin hash -- find patch that introduced the hash v press directory leaves -- press all patches up to head EOF exit 0 }
As you can see, the command selection is pretty standard. No automatic initialization/syncing with a remote code shelf though.
checkdir() { if [ ! -d "$1" ]; then echo "$1 directory not found" kill -TERM 0 fi }
Kill the application if a directory does not exist. Print the error message before killing.
loadwot() { vgpgdir="$(mktemp -d v-gpg-XXXXXX)" vpatches="$vgpgdir/vpatches.txt" flow="$vgpgdir/flow.txt" filtered="$vgpgdir/filtered.txt" selfile="$vgpgdir/selected.txt" simplified="$vgpgdir/simplified.txt" topo="$vgpgdir/topo.txt" checkdir wot for i in wot/*.asc; do gpg --homedir "$vgpgdir" --logger-fd 1 --batch --keyid-format=long --import $i >/dev/null 2>&1 done }
Creates the temporary GPG home directory, and initializes variables with paths to temporary files inside $vgpgdir. After that, tries to load the WoT from "wot" directory. I am expecting that people will ask to change "wot" to ".wot" as is the convention right now. I do not really like "hidden" files, though, so the presser I use on main machine uses "wot" and "seals".
cleanup() { if [ ! -z "$vgpgdir" ]; then rm -rf "$vgpgdir" fi exit 0 } trap cleanup TERM INT exit_msg() { echo $1 cleanup }
cleanup() deletes the temporary directory and exits. This functionality is invoked on TERM and INT signals. There is also exit_msg() function that prints the error message and exits with a cleanup.
verify() { vpatchpath="$1" vpatch="$(basename $vpatchpath)" signers="" checkdir seals for i in seals/$vpatch.*.sig; do signature="$(basename $i | sed 's/.sig$//g')" signer="${signature#$vpatch.}" if ! gpg --homedir "$vgpgdir" --logger-fd 1 --batch --keyid-format=long --verify "$i" "$vpatchpath" 2>&1 >/dev/null; then exit_msg "Bad signature $i for $vpatch" fi signers="$signers $signer" done [ -z "$signers" ] && exit_msg "No signatures for $vpatch" echo "$vpatch $signers" >> "$vgpgdir"/signers.txt }
verify() checks that the provided vpatch matches the signatures. If the signature verification fails or if there are no signers of the vpatch, terminates the execution. It also records the signers of the vpatch in signers.txt in the temporary directory.
verifyall() { checkdir patches for i in patches/*.vpatch; do verify $i; done }
Just verifies all vpatches in the patches directory.
filterall() { n=0 for i in patches/*.vpatch; do vfilter "$n" < "$i" echo "$i $n" >> "$vpatches" n="$((n+1))" done }
To calculate the dependencies across the vpatches, only hunks headers are necessary, not the actual contents of hunks. vfilter is the program that extracts the headers of the hunks, and prints them out in the following form:
vpatch-id in-file out-file in-hash out-hash
vpatch-id is passed as the first CLI argument to vfilter.
filterall() assigns each vpatch a numeric identifier, pipes the vpatch through vfilter program, and records vpatch-numerid id mapping in $vpatches file:
patches/vpatch-1.vpatch 0 patches/vpatch-2.vpatch 1 ...
vfilter is currently implemented as an Ada application, though in practice it is equally feasible to be implemented in awk: the performance difference turned out to be 20 seconds for awk vs 15 seconds in Ada for filtering the linux-genesis.vpatch (~600Mb). The price that must be paid for shaving off these 5 seconds is cut-paste restructuring of vpatch.adb, and an extra 50-line Ada application. Is it worth an extra vpatch? I intend to provide both versions so far.
internall() { awk ' function dedupf(i) {if (!(i in a)) {a[i]=fcnt; fcnt++;} return a[i];} function deduph(i) {if (!(i in b)) {b[i]=hcnt; hcnt++;} return b[i];} BEGIN{fcnt=1;hcnt=0; deduph("false")} {print $1, dedupf($2), dedupf($3), deduph($4), deduph($5)}' }
However, the output of the filterall is still not good enough for Ada, which works best with numeric arrays. We perform some impedance matching in internall() by, well, interning the filenames and hashes using awk - replacing each of those with a numeric id. Nice thing is that all these mappings can be dropped immediately afterwards. We also make sure that the special hash value of "false" (file does not exist) gets the id of zero. The output of this function is:
vpatch-id in-file-id out-file-id in-hash-id out-hash-id
For example:
0 34 34 14 15
In practical terms, 600 Mb of linux-genesis.vpatch turns to 12 Mb after vfilter, and to 1.2 Mb after internall().
vpatchnos() { awk -v VF="$vpatches" ' FILENAME==VF {a[$1]=$2} FILENAME=="-" {print a[$1]}' "$vpatches" - } vpatchnames() { awk -v VF="$vpatches" ' FILENAME==VF {a[$2]=$1} FILENAME=="-" {print a[$1]}' "$vpatches" - }
Utility functions for mapping vpatch names to their pre-recorded numeric ids and vice-versa.
vpatchflow() { [ -z "$*" ] && exit_msg "No leaves selected." filterall | internall >"$filtered" selected="" if [ "$*" == "-" ]; then selected=$(awk '{print $2}' < "$vpatches") else selected="$(echo $* | tr ' ' 'n' | vpatchnos)" fi vflow $(wc -l <"$filtered") $(wc -l <"$vpatches") <"$filtered" >"$flow" echo "$selected" | sed -e 's/^/s /g' - | sort | uniq >"$selfile" cat "$flow" "$selfile" | vtoposort > "$topo" }
vpatchflow() calculates the dependencies among vpatches: first, it processes all vpatches through filterall-internall pipeline; then it determines which leaves are selected for pressing. Using "-" as a leaf selects all vpatches; otherwise, you can provide your selection as patches/A.vpatch, patches/B.vpatch - filename completion works. The filtered vpatches are fed into vflow application, which expects two CLI arguments: the count of hunks in all vpatches, and the count of vpatches; it reads the input from stdin. The output has the following format:
v vpatches-count d descendant-id antecedent-id ... c vpatch1-id vpatch2-id ...
- First line is the same vpatches-count that was provided on the CLI, to save on the repetitive work, the next tool (vtoposort) will read it.
- Lines starting with "d" list dependencies between vpatches.
- Lines starting with "c" list mutually incompatible vpatches.
Then, selected vpatches form the second part of input to the vtoposort:
s selected-leaf-1-id ...
These two files are concatenated and passed to vtoposort; it's output is:
[C conflict1-id] [C conflict2-id] ... [L loop1-id] [L loop2-id] ... P press1-id P press2-id ...
If there are any conflicting vpatches among the selected, they will be printed one per line, preceded by "C". Then, results of toposort will be printed: if the calculation was successful, the sorted order is printed preceded by "P", otherwise there is a loop and all vpatches in the loop are printed prefixed with "L".
vpatchflowerr() { ret=$(head -n1 "$topo" | awk '{print $1}') if [ "$ret" = "C" ]; then printf "nConflicting leaves selected:n" awk '$1 == "C" {print $2}' < "$topo" | vpatchnames | sort | uniq cleanup elif [ $ret == "L" ]; then echo "nLoop among vpatches:" awk '{print $2}' < "$topo" | vpatchnames cleanup elif [ "$ret" != "P" ]; then echo "Unknown error. Investigate in $vgpgdir:" cat "$topo" exit 0 # no cleanup fi }
vpatchflowerr() detects errors in the results of toposort, and report them to the user.
[ -z "$1" ] && usage
At this point we can start looking at the command line arguments: print usage if v.sh is invoked without arguments.
loadwot if [ "$1" == "wot" ]; then gpg --homedir "$vgpgdir" --logger-fd 1 --batch --keyid-format=long -k cleanup fi
WoT is necessary for all other commands, so we can load it at this point. If this is all that was requested, print it and exit.
verifyall
Rest of the commands work with vpatches, verify them.
if [ "$1" == "patches" ]; then cat "$vgpgdir"/signers.txt cleanup fi
Output of the "patches" command is as easy as output of the signers.txt (see verifyall()).
if [ "$1" == "graph" ]; then shift dotfile="$1" shift svgfile="$1" shift vpatchflow - cat "$flow" "$topo" | vsimplify > "$simplified" cat >"$dotfile" <<EOF digraph { node [shape=rectangle, color=black, style=bold] EOF awk -v VF="$vpatches" ' FILENAME==VF {sub(".*/", "", $1);sub(".vpatch$","",$1);a[$2]=$1} FILENAME=="-" && $1 == "d" {print """a[$3]"" -> ""a[$2]"""} ' "$vpatches" - <"$simplified" >>"$dotfile" echo "}" >> "$dotfile" dot -Tsvg $dotfile -o $svgfile # dot -Tcmapx $dotfile -o $svgfile.html-map cleanup fi
graph command is a bit longer:
- First, consume the commmand line arguments.
- Calculate flow and toposort for all vpatches
- Feed flow and toposort to the vsimplify tool, that removes redudand connections from the graph. Visually:
Figure 1. Before.
Figure 2. After.
vsimplify produces the following output:I descendant-id indirect-antecedent-id ... d descendant-id direct-antecedent-id
The lines prefixed with "d" in the output of vsimplify (unlike in output of vflow) have a property that: ∀ v1 ∈ d(V) ∄ v2 ∈ d(V) : v1 ∈ I(v2)
where:- d(V) is list of antecedents of vpatch V as per vsimplify output (d V A, d V B, ...)
- I(V) is list of direct and indirect antecedents of vpatch V (calculated and printed by vsimplify prefixed with "I").
- This output is fed to an awk script that produces a graphviz file, graphviz creates an svg.
if [ "$1" == "antecedents" ]; then shift vpatch="$1" shift vpatchflow - vno=$(awk -v V="$vpatch" '$1==V {print $2}' < "$vpatches") cat "$flow" "$topo" | vsimplify > "$simplified" awk -v S="$simplified" -v T="$topo" -v N="$vno" ' FILENAME==S && ($1 == "I" || $1 == "d" ) && $2 == N {a[$3]=1} FILENAME==T && ($1 == "P") {if ($2 in a) {print $2}} ' "$simplified" "$topo" | vpatchnames cleanup fi if [ "$1" == "descendants" ]; then shift vpatch="$1" shift vpatchflow - vno=$(awk -v V="$vpatch" '$1==V {print $2}' < "$vpatches") cat "$flow" "$topo" | vsimplify > "$simplified" awk -v S="$simplified" -v T="$topo" -v N="$vno" ' FILENAME==S && ($1 == "I" || $1 == "d" ) && $3 == N {a[$2]=1} FILENAME==T && ($1 == "P") {if ($2 in a) {print $2}} ' "$simplified" "$topo" | vpatchnames cleanup fi
These two commands use direct+indirect antecedents calculated by vsimplify to print antecedents or descendants in the press order.
if [ "$1" == "origin" ]; then shift hash="$1" shift filterall >"$filtered" awk -v hash="$hash" '$5 == hash {print $1}' <"$filtered" | vpatchnames cleanup fi;
Origin does not need to intern anything, it selects the hash out of filtered but not interned vpatches.
if [ "$1" == "flow" ]; then shift vpatchflow $* echo "Press order:" awk '$1 == "P" {print $2}' < "$topo" | vpatchnames vpatchflowerr cleanup fi
Calculate flow according to the selected vpatches, output it, but also report errors.
if [ "$1" == "press" ]; then shift dir="$1" shift [ -d "$dir" ] && exit_msg "Directory $dir already exists" mkdir -p "$dir" vpatchflow $* vpatchflowerr awk '{print $2}' < "$topo" | vpatchnames | while read i; do i=$(readlink -f $i) cd "$dir" vpatch <"$i" >/dev/null cd - >/dev/null done cleanup fi
Press is also trivial at this point: read arguments, create press directory, calculate press order, press according to the order.
cleanup
Fin.
curl 'http://bvt-trace.net/vpatches/vtools_add_vsh.vpatch' > vtools_add_vsh.vpatch curl 'http://bvt-trace.net/vpatches/vtools_add_vsh.vpatch.bvt.sig' > vtools_add_vsh.vpatch.bvt.sig
- Recently, Diana Coman has provided an extensive overview of the state of V. [↩]