Is Guix full-source bootstrap a lie?
No, it is not a lie! Guix blog post told me: « today, you get a package graph of more than 22,000 nodes rooted in a 357-byte program ». Mind-blowing! To my knowledge, the story is unique. For the interested reader, the adventure starts with Mes, Maxwell's Equations of Software – an attempt at dissolving bootstrap binaries, and then it is a journey, such excursion!
Can I verify myself the claims? How can I use Guix for checking by myself?
Consider a complex machine learning framework as PyTorch. It provides two high-level features as tensor computing (like powerful Python N-dimensional arrays as NumPy) and deep neural networks built on a tape-based automatic differentiation system, among many other features. Well, it is a real-world modern open source software. Something serious, isn't it?
From a security perspective, or even for cumulative science, the core question is thus: How can I trust the computations returned by PyTorch? This asks very broad topics. Let focus on just one: Can I trust the whole stack of software required by PyTorch? Other said, can I audit all the software and how they are built before they finally produce PyTorch that I run?
If you are not sure what means “required by PyTorch” here, please give a look at this FOSDEM 2023 talk or this short paper. Hope that's fine, continuing…
The number of dependencies (nodes) for the package python-pytorch
is more
than 1150 and Guix provides a mean for inspecting, verifying or auditing all
these dependencies; i.e., how they are built and linked altogether. Guix even
shows the shortest path (guix graph --path
) between the package
python-pytorch
and the root of the very beginning: binary Hex assembler.
python-pytorch@1.13.1 +--> python-matplotlib@3.5.2 +--> kexec-tools@2.0.23 onnx@1.12.0 | python-wxpython@4.2.0 | ld-wrapper@0 python-nbval@0.9.6 | wxwidgets@3.2.2.1 | guile@3.0.9 python-ipykernel@6.13.0 | webkitgtk-with-libsoup2@2.40.5 | gzip-mesboot@1.2.4 python-ipython@8.5.0 | elogind@252.9 | tcc-boot0@0.9.26-1136-g5bba73cc ... ---^ ... ---^ stage0-posix@1.4 bootstrap-seeds@1.0.0
Take a moment, it is something!
Guix makes transparent all the supply chain. One could see a supply chain as a graph of dependencies implementing how to process each node of this graph. That’s exactly the definition of Guix, no? Is Guix some software supply chain?
Now, how can I use Guix for checking by myself there is no cheat around? Ready for diving into plumbing solutions about some chicken-or-the-egg dilemma? Let’s go!
Chicken-or-the-egg problem
A chicken lays an egg, chicken hatches from egg, result: dilemma. How do we solve this infinite regress? In computing, it is solved by considering an initial chicken without asking where it does come from. What is the issue of this solution?
Consider the genealogy: a chicken lays an egg, that egg becomes chicken and then also lays a egg, again that egg becomes chicken and also lays a egg, etc. After some generations, we notice a genetic mutation for this genealogy. Darwinism aside, what does make the introduction of such genetic mutation? For detecting the anomaly, we need to carefully verify two things:
- that each egg is correctly laid without introducing a genetic mutation,
- that the very first chicken is safe with this genetic mutation.
If we do not control with enough fine-grain these both conditions, then the genealogy suffers from the « Trusting Trust attack ». Applied to computing, this genetic mutation is the metaphor of a backdoor or a fraud. Applied to compilers, it was described by Ken Thompson in « Reflections on Trusting Trust » in 1984. Today, controlling with very fine-grain the whole supply chain is the challenge for security or reproducibility, in scientific context too. My explanations are not enough visual, please give a look to this nice short video.
Considering the package python-pytorch
, the chicken-or-the-egg problem is a
recursive problem and it leads to carefully scrutinize:
- the whole chain with more than 1150 dependencies,
- and the initial
bootstrap-seeds
.
Because it is a recursive problem, we specifically need to check the initial
bootstrap-seeds
and all the source code of each dependencies. Then, because
Guix precisely determines how each successor entity depends on or is produced
by some of its predecessors, we are fine. Or we should.
Well, it seems straightforward or easy but it is far to be. Thanks to tireless effort by Janneke and Guix community, now the initial set (#2) that needs to be carefully checked is approachable. And by design of Guix, #1 means verify the source code of packages.
The impressive Bootstrapping story in Guix
The best resource for this story is the blog post « The Full-Source Bootstrap: Building from source all the way down ». Let me explain using my own words how I understand the story.
When PyTorch starts with almost nothing
Starting with a binary Hex assembler, each dependency is built and last we get
the machine learning framework PyTorch. That’s what does for us the command
guix build python-pytorch
: build all the 1150 dependencies; the successor
depending only on a restricted set of predecessors already built. And the
initial chicken is provided by the package bootstrap-seeds
.
What does contain this bootstrap-seeds
? Not much.
$ du -sh $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)')
232K /gnu/store/dzciy04rqks9n6y8vrjyd5yrqpw1mjw4-bootstrap-seeds-1.0.0
It is just some files of less than 500 lines; mainly assembler code written by hand. Yeah, really! Written with a lot of love by real humans. Many thanks to Jeremiah Orians, Jan (janneke) Nieuwenhuizen and Sanne Wouda. Please note that nothing is built by Guix itself, what you fetch is what you run:
$ tar xvf $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)' --source) $ guix hash -rx $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)') bootstrap-seeds 0is4q1jaf6lswxbakcc22ia6z3hh55v1yw170kpwllrm7gl2byzx 0is4q1jaf6lswxbakcc22ia6z3hh55v1yw170kpwllrm7gl2byzx
That’s a piece of work! Concretely, it is possible to audit how, starting
from these binary seeds provided by bootstrap-seeds
, the next egg named
package stage0-posix
is built. It first builds hex0
and then all the way
up: hex1
, catm
, hex2
, M0
, cc_x86
, M1
, M2
, get_machine
– that's
all of MesCC-Tools – and finally M2-Planet. And then Guix builds all the
chain up to python-pytorch
. For the adventurous reader, give a look to the
plumbing details; for instance, the description of how Guix builds
stage0-posix
is given by:
$ guix build -e '(@@ (gnu packages commencement) stage0-posix)' --derivations
/gnu/store/hngai1p79jymvjfgvyjly5w5rrnpc1i4-stage0-posix-1.4.drv
And Guix spots out all the dependencies, no cheat! Here stage0-posix
also
needs gash-boot
and gash-utils-boot
: a POSIX-compatible shell and some Core
POSIX utilities, both all written in GNU Guile/Scheme. And again, I can audit
all.
Now two related questions pops up in mind. Is bootstrap-seeds
the root of
everything or do we need another program? How to compile these both
Guile/Scheme packages? The command guix graph
answers for us.
Ok, we also need the software bootar
and guile-bootstrap
for being able to get
bootstrap-seeds
and stage0-posix
and then start the chain.
What is bootar
?
bootar
is an implementation of Tar decompression and extraction in
Guile/Scheme, quoting its description:
Bootar is a simple Tar extractor written in Guile/Scheme. It supports running
tar xvf
on uncompressed tarballs or tarballs that are compressed with BZip2, GZip, or XZ. It also provides standalone scripts forbzip2
,gzip
, andxz
that each support decompression to standard output.What makes this special is that Bootar is distributed as a self-extracting Scheme (SES) program. That is, a little script that outputs the source code of Bootar. This makes it possible to go from pure Scheme to Tar and decompression in one easy step.
Oh, that's brilliant. However, it means we need Guile for running it. Yeah,
provided by guile-bootstrap
.
Origin of all: guile-bootstrap
, the driver
Well, the description of the package is a bit sparse.
$ guix show guile-bootstrap name: guile-bootstrap version: 2.0 outputs: + out: everything systems: x86_64-linux mips64el-linux aarch64-linux powerpc64le-linux + riscv64-linux i686-linux armhf-linux i586-gnu powerpc-linux dependencies: location: gnu/packages/bootstrap.scm:573:3 homepage: #f license: LGPL 3+ synopsis: Bootstrap Guile description: Pre-built Guile for bootstrapping purposes.
Please note the interesting point here, there is no dependencies. It suggests the package is standalone. What is the source? Hum, we need to dig in the code and the definition reads,
(define %bootstrap-guile ;; The Guile used to run the build scripts of the initial derivations. ;; It is just unpacked from a tarball containing a pre-built binary. ;; This is typically built using %GUILE-BOOTSTRAP-TARBALL below. ;; ;; XXX: Would need libc's `libnss_files2.so' for proper `getaddrinfo' ;; support (for /etc/services). (let ((raw (build-system (name 'raw) (description "Raw build system with direct store access") (lower make-raw-bag)))) (package (name "guile-bootstrap") (version "2.0") (source #f) (build-system raw) (synopsis "Bootstrap Guile") (description "Pre-built Guile for bootstrapping purposes.") (home-page #f) (license lgpl3+))))
No source, really? So, no dependencies and no source, really? Is
guile-bootstrap
the origin of the Universe creating something from the Void?
Guix is a damned good tool because it allows to inspect everything. How is
guile-bootstrap
built by Guix? Ask Guix and it answers: look at its
derivation,
$ guix build guile-bootstrap --derivations /gnu/store/5my8fmb3q4kpgmq0f16xffx6yai4rwbh-guile-bootstrap-2.0.drv
Sorry, we need to enter into the arcane of Guix. This guile-bootstrap
is not
created from nothing, it is the result of a build process (derivation) where
the inputs and the builder are clearly and explicitly defined. Nothing less,
nothing more.
Derive ([("out","/gnu/store/lgi9x15a0w35mcpd7g1kb9274r6wy4pv-guile-bootstrap-2.0","","")] ,[("/gnu/store/1k9p0fml8apa0c30wdyq5pifgifz5wks-tar.drv",["out"]) ,("/gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv",["out"]) ,("/gnu/store/nk17ldpmrh26z0fy3r7b4ipap2hqpm51-mkdir.drv",["out"]) ,("/gnu/store/rm2vdddlz1maz5rv069j1zp3ma3ydg43-guile-2.0.9.tar.xz.drv",["out"]) ,("/gnu/store/sd8g1dps3hr9kmy4c1v00pmigmk86fmw-xz.drv",["out"])] ,["/gnu/store/nmxl6qyj16bv4rx4irhg23r66gn752kd-build-bootstrap-guile.sh"] ,"x86_64-linux","/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash",["/gnu/store/nmxl6qyj16bv4rx4irhg23r66gn752kd-build-bootstrap-guile.sh"] ,[("GUILE_TARBALL","/gnu/store/3aigj659vsjxcrhn0r0vmkbjh3kj5pbs-guile-2.0.9.tar.xz") ,("out","/gnu/store/lgi9x15a0w35mcpd7g1kb9274r6wy4pv-guile-bootstrap-2.0")])
Let take a breath before diving deeper. Still following?
Analysing guile-bootstrap
derivation
This guile-bootstrap
derivation is composed by 3 parts:
- the script builder:
build-bootstrap-guile.sh
- helpers:
tar
,bash
,mkdir
andxz
- Guile binary:
guile-2.0.9.tar.xz
The script builder reads,
echo "unpacking bootstrap Guile to '$out'..." /gnu/store/jc2g1wcfwkxr7hindy29s744sgxn1w63-mkdir $out cd $out /gnu/store/qc9b01x31ayxb36r0zw5cw28awisdq98-xz -dc < $GUILE_TARBALL | /gnu/store/k015iy4mrrbd5vf2ihz40ai8swlvcj2p-tar xv # Use the bootstrap guile to create its own wrapper to set the load path. GUILE_SYSTEM_PATH=$out/share/guile/2.0 GUILE_SYSTEM_COMPILED_PATH=$out/lib/guile/2.0/ccache $out/bin/guile -c "(...)" $out /gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash # Sanity check. $out/bin/guile --version
Nothing fancy. It just creates a directory, extract the compressed content, and “use the bootstrap Guile to create its own wrapper to set the load path”. No trap here.
All the helpers are indeed binaries. For instance, look:
$ file $(guix build /gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv) /gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.30, stripped $ $(guix build /gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv) --version GNU bash, version 4.2.0(1)-release (i686-pc-linux-gnu) Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Concretely, how does it work? These 5 derivations for helpers and Guile describe where to fetch binaries and nothing more. These 5 derivations are fixed-output derivations. What does it mean?
Operations such as downloads for which the expected content hash is known in advance are modeled as fixed-output derivations. Unlike regular derivations, the outputs of a fixed-output derivation are independent of its inputs—e.g., a source code download produces the same result regardless of the download method and tools being used. Quoting section Derivations from the Guix manual.
Summarizing what we get
In all that story, which binary do we have to trust?
First, we must trust bootstrap-seeds
. That’s nice because it is small enough
for auditing the binary seeds. That’s already very excellent!
That’s said, although the package graph is rooted in bootstrap-seeds
, second
we also must trust the driver. It means: 1.3MiB for tar
, 1.3MiB for bash
,
0.7MiB for mkdir
and 0.844MiB for xz
. Last, 14MiB for uncompressed static
Guile binary. Result, it is a bit less that 20MiB.
In the blog post, they count 25MiB and I miss where the 5MiB come from. Even
considering bootar
which is about 564KiB. Hum, is 25 an overestimation or is
my 20 missing an usual suspect?
Opinionated next steps
The next question is how to rid of guile-bootstrap
? Here, it potentially
reads twofold related tasks:
- decrease
guile-bootstrap
and its binary helpers:tar
,bash
,mkdir
andxz
, - build
stage0-posix
with less dependencies.
If gash
and gash-utils
could run with another Scheme implementation, then it
would remove one dependency to guile-bootstrap
. It is not clear for me that
stage0-posix
internally implements features that could be built before the
need of gash
and gash-utils
. Else, I do not get why I could trust that other
Scheme binary implementation?
Moreover, I am also confused by the need of bootar
when the binary helper tar
is already dragged in the picture.
Considering this current approach, the Scheme implementation named “driver“
will be still around, for one need or the other. Now all is reduced to these
tiny bits, I do not see how it could be more elegant than implementing by hand
directly in binary what needs bootstrap-seeds
and stage0-posix
for
building.
Other said, it means manually implements a minimalist extractor and a minimalist shell directly in binary assembler. Is it affordable? I do not know… I only know that video folks have implemented dav1d – a AV1 decoder on all platforms mainly implemented in assembler. For the chicken-or-the-egg at hand, is the effort worth? And is such path better for transparency? Well, I would answer by another question: what does it mean “being verifiable”?
Any thought how to get rid of the driver? Or how to reduce again the binaries that need to be trust?
What is not explicitly told?
One of the biggest concern, in my humble opinion, about the current state of
this awesome story is non-deterministic compilations. And especially at early
stages, for example gash-boot
.
$ guix build -e '(@@ (gnu packages commencement) gash-boot)' $ guix build -e '(@@ (gnu packages commencement) gash-boot)' --check guix build: error: derivation `/gnu/store/…-gash-boot-0.3.0.drv' may not be deterministic: output `/gnu/store/…-gash-boot-0.3.0' differs
From the section above about chicken-or-the-egg problem, here the condition #1 « that each egg is correctly laid without introducing a genetic mutation » is not satisfied. What you get is not what I get. Therefore, it weakens the strategy, not to say potentially defeats it. The old – still open – bug #20272 tracks the infrequent progress; difficult problem though.
Moreover, all this Bootstrap story only applies for packages where the C
programming language is part of the bootstrapping mean. For instance, it is
applicable for some Python based on the interpreter CPython. It is also
applicable for the R language interpreter. Etc.
Which packages are not rooted in that bootstrap story so?
Only the packages that depends on programming languages that are not
bootstrapped from the C
programming language. Which ones? Concretely, at
least all the packages where the programming languages OCaml or Haskell are
involved.
These languages appears to you obscure, maybe, but do not think that you are
probably not relying on them. Examples of popular program: unison
for OCaml
and pandoc
for Haskell.
Concrete example. Seurat is an R package designed for QC, analysis, and
exploration of single-cell RNA-seq data. It is used daily by bioinformatic
folks; maybe run in production in some hospitals for helping in curing
diseases as cancer. Relationship with Haskell programming language? The
package Seurat depends on pandoc
, hence Haskell, via four intermediary other
packages. Trusting the computations of Seurat implies the trust of the
Haskell bootstrap compiler (somehow, similar as previous guile-bootstrap
).
Another example. The package python-pymc
is library for probabilistic
programming (formerly PyMC3) which helps for Bayesian statistical modeling
focusing on advanced Markov chain Monte Carlo (MCMC) or variational inference
algorithms. Nothing related with Haskell at first but this Python library
requires for one of its dependencies an Haskell compiler.
What is the size of binaries we must trust for building the package r-seurat
or python-pymc
? It is more than 450MiB. Ouch, nothing less. How to check
that? Let first determine which is the first Haskell compiler of the chain.
Tricky:
$ guix graph -t bag-emerged pandoc | grep label | grep -E 'ghc-[0-9]' | cut -f1 -d'[' "/gnu/store/8d2dsa57jww1ncyqf1fmj1vnjfx7ax1z-ghc-9.2.5.drv" "/gnu/store/1kd8x3kg22nwpmn97ihkzr15g1g1nk80-ghc-8.10.7.drv" "/gnu/store/b2lfx9psykr4lkj874kfrglysia2p3wp-ghc-8.6.5.drv" "/gnu/store/s63fpi10cd0d4h1pc4la0paj50r31g3d-ghc-8.4.4.drv" "/gnu/store/arybgkdkzrkwjcw1vnrq3k4s9nh6mxnd-ghc-8.0.2.drv" "/gnu/store/li4d0pri94qj17i0wiihvkfh7hchj7ix-ghc-7.10.3.drv"
Then let inspect how the package ghc@7.10.3
is built by giving a look to the
derivation. There, we see the dependency with
ghc-7.8.4-x86_64-unknown-linux-deb7.tar.xz
and after extraction, bang! More
than 450MiB of binaries inside. How many packages are impacted by that? Ask
Guix:
$ guix refresh -l ghc@7 | cut -f1 -d':'
Building the following 741 packages would ensure 2181 dependent packages are rebuilt
Almost 2,200 packages rely on opaque 450MiB binary and are very far from the nice Bootstrap story above. Same story for OCaml, although the number of packages is less then hundred; considering the 3 supported versions of OCaml, it is a bit less than 200 packages.
And there is maybe other packages around that require a small set of external binaries.
The number of available packages that Guix offers is growing daily. Today, I count 27,813 packages – without counting extra channels – and subtracting OCaml, Haskell, etc., then more than 22,000 packages seem bootstrapped by less than 25MiB. Yeah, mind-blowing!
Yet another chicken-or-the-egg problem?
The bootstrapping problem for Haskell is not solved. And Ricardo works hard
on it. Currently, from the older GHC around (4.08.2), which relies on
gcc-2.95
– part of the Bootstrapping story above – it is possible to chain
until version 6.10.4. Then versions 6.12.3 and 7.4.2 are not packaged yet for
completing the Haskell chain from version 4.08.2 to modern version as 9.2.5;
fully connecting the dots with bootstrap-seeds
and dropping these 450MiB of
binaries. The solution of this chicken-or-the-egg is not yet complete.
About OCaml, version 4.07 is connected to the Bootstrapping story. However, that’s not the case for the higher versions, as 4.14 or 5. The solution of this chicken-or-the-egg is done but not propagated then.
That’s a long post! Are we done now?
Wait, what do I need for running all these computations starting from
bootstrap-seeds
and friends? We assume that the binary guix-daemon
is
running, right? Euh?! How do you produce it? Another chicken-or-the-egg
problem, no?
Join the fun, join Guix!
For reproducing purpose, Guix revision is
6113e05
.