Is Guix full-source bootstrap a lie?

No, it is not a lie! Guix blog post told me: « today, you get a package graph of more than 22,000 nodes rooted in a 357-byte program ». Mind-blowing! To my knowledge, the story is unique. For the interested reader, the adventure starts with Mes, Maxwell's Equations of Software – an attempt at dissolving bootstrap binaries, and then it is a journey, such excursion!

Can I verify myself the claims? How can I use Guix for checking by myself?

Consider a complex machine learning framework as PyTorch. It provides two high-level features as tensor computing (like powerful Python N-dimensional arrays as NumPy) and deep neural networks built on a tape-based automatic differentiation system, among many other features. Well, it is a real-world modern open source software. Something serious, isn't it?

From a security perspective, or even for cumulative science, the core question is thus: How can I trust the computations returned by PyTorch? This asks very broad topics. Let focus on just one: Can I trust the whole stack of software required by PyTorch? Other said, can I audit all the software and how they are built before they finally produce PyTorch that I run?

If you are not sure what means “required by PyTorch” here, please give a look at this FOSDEM 2023 talk or this short paper. Hope that's fine, continuing…

The number of dependencies (nodes) for the package python-pytorch is more than 1150 and Guix provides a mean for inspecting, verifying or auditing all these dependencies; i.e., how they are built and linked altogether. Guix even shows the shortest path (guix graph --path) between the package python-pytorch and the root of the very beginning: binary Hex assembler.

python-pytorch@1.13.1    +--> python-matplotlib@3.5.2        +--> kexec-tools@2.0.23
onnx@1.12.0              |    python-wxpython@4.2.0          |    ld-wrapper@0
python-nbval@0.9.6       |    wxwidgets@3.2.2.1              |    guile@3.0.9
python-ipykernel@6.13.0  |    webkitgtk-with-libsoup2@2.40.5 |    gzip-mesboot@1.2.4
python-ipython@8.5.0     |    elogind@252.9                  |    tcc-boot0@0.9.26-1136-g5bba73cc
    ...               ---^        ...                     ---^    stage0-posix@1.4
                                                                  bootstrap-seeds@1.0.0

Take a moment, it is something!

Guix makes transparent all the supply chain. One could see a supply chain as a graph of dependencies implementing how to process each node of this graph. That’s exactly the definition of Guix, no? Is Guix some software supply chain?

Now, how can I use Guix for checking by myself there is no cheat around? Ready for diving into plumbing solutions about some chicken-or-the-egg dilemma? Let’s go!

Chicken-or-the-egg problem

A chicken lays an egg, chicken hatches from egg, result: dilemma. How do we solve this infinite regress? In computing, it is solved by considering an initial chicken without asking where it does come from. What is the issue of this solution?

Consider the genealogy: a chicken lays an egg, that egg becomes chicken and then also lays a egg, again that egg becomes chicken and also lays a egg, etc. After some generations, we notice a genetic mutation for this genealogy. Darwinism aside, what does make the introduction of such genetic mutation? For detecting the anomaly, we need to carefully verify two things:

  1. that each egg is correctly laid without introducing a genetic mutation,
  2. that the very first chicken is safe with this genetic mutation.

If we do not control with enough fine-grain these both conditions, then the genealogy suffers from the « Trusting Trust attack ». Applied to computing, this genetic mutation is the metaphor of a backdoor or a fraud. Applied to compilers, it was described by Ken Thompson in « Reflections on Trusting Trust » in 1984. Today, controlling with very fine-grain the whole supply chain is the challenge for security or reproducibility, in scientific context too. My explanations are not enough visual, please give a look to this nice short video.

Considering the package python-pytorch, the chicken-or-the-egg problem is a recursive problem and it leads to carefully scrutinize:

  1. the whole chain with more than 1150 dependencies,
  2. and the initial bootstrap-seeds.

Because it is a recursive problem, we specifically need to check the initial bootstrap-seeds and all the source code of each dependencies. Then, because Guix precisely determines how each successor entity depends on or is produced by some of its predecessors, we are fine. Or we should.

Well, it seems straightforward or easy but it is far to be. Thanks to tireless effort by Janneke and Guix community, now the initial set (#2) that needs to be carefully checked is approachable. And by design of Guix, #1 means verify the source code of packages.

The impressive Bootstrapping story in Guix

The best resource for this story is the blog post « The Full-Source Bootstrap: Building from source all the way down ». Let me explain using my own words how I understand the story.

When PyTorch starts with almost nothing

Starting with a binary Hex assembler, each dependency is built and last we get the machine learning framework PyTorch. That’s what does for us the command guix build python-pytorch: build all the 1150 dependencies; the successor depending only on a restricted set of predecessors already built. And the initial chicken is provided by the package bootstrap-seeds.

What does contain this bootstrap-seeds? Not much.

$ du -sh $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)')
232K    /gnu/store/dzciy04rqks9n6y8vrjyd5yrqpw1mjw4-bootstrap-seeds-1.0.0

It is just some files of less than 500 lines; mainly assembler code written by hand. Yeah, really! Written with a lot of love by real humans. Many thanks to Jeremiah Orians, Jan (janneke) Nieuwenhuizen and Sanne Wouda. Please note that nothing is built by Guix itself, what you fetch is what you run:

$ tar xvf $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)' --source)
$ guix hash -rx  $(guix build -e '(@@ (gnu packages commencement) bootstrap-seeds)') bootstrap-seeds
0is4q1jaf6lswxbakcc22ia6z3hh55v1yw170kpwllrm7gl2byzx
0is4q1jaf6lswxbakcc22ia6z3hh55v1yw170kpwllrm7gl2byzx

That’s a piece of work! Concretely, it is possible to audit how, starting from these binary seeds provided by bootstrap-seeds, the next egg named package stage0-posix is built. It first builds hex0 and then all the way up: hex1, catm, hex2, M0, cc_x86, M1, M2, get_machine – that's all of MesCC-Tools – and finally M2-Planet. And then Guix builds all the chain up to python-pytorch. For the adventurous reader, give a look to the plumbing details; for instance, the description of how Guix builds stage0-posix is given by:

$ guix build -e '(@@ (gnu packages commencement) stage0-posix)'  --derivations
/gnu/store/hngai1p79jymvjfgvyjly5w5rrnpc1i4-stage0-posix-1.4.drv

And Guix spots out all the dependencies, no cheat! Here stage0-posix also needs gash-boot and gash-utils-boot: a POSIX-compatible shell and some Core POSIX utilities, both all written in GNU Guile/Scheme. And again, I can audit all.

Now two related questions pops up in mind. Is bootstrap-seeds the root of everything or do we need another program? How to compile these both Guile/Scheme packages? The command guix graph answers for us.

graph-bootstrap.jpg
Figure 1: dependencies of stage0-posix

Ok, we also need the software bootar and guile-bootstrap for being able to get bootstrap-seeds and stage0-posix and then start the chain.

What is bootar?

bootar is an implementation of Tar decompression and extraction in Guile/Scheme, quoting its description:

Bootar is a simple Tar extractor written in Guile/Scheme. It supports running tar xvf on uncompressed tarballs or tarballs that are compressed with BZip2, GZip, or XZ. It also provides standalone scripts for bzip2, gzip, and xz that each support decompression to standard output.

What makes this special is that Bootar is distributed as a self-extracting Scheme (SES) program. That is, a little script that outputs the source code of Bootar. This makes it possible to go from pure Scheme to Tar and decompression in one easy step.

Oh, that's brilliant. However, it means we need Guile for running it. Yeah, provided by guile-bootstrap.

Origin of all: guile-bootstrap, the driver

Well, the description of the package is a bit sparse.

$ guix show guile-bootstrap
name: guile-bootstrap
version: 2.0
outputs:
+ out: everything
systems: x86_64-linux mips64el-linux aarch64-linux powerpc64le-linux
+ riscv64-linux i686-linux armhf-linux i586-gnu powerpc-linux
dependencies:
location: gnu/packages/bootstrap.scm:573:3
homepage: #f
license: LGPL 3+
synopsis: Bootstrap Guile
description: Pre-built Guile for bootstrapping purposes.

Please note the interesting point here, there is no dependencies. It suggests the package is standalone. What is the source? Hum, we need to dig in the code and the definition reads,

(define %bootstrap-guile
  ;; The Guile used to run the build scripts of the initial derivations.
  ;; It is just unpacked from a tarball containing a pre-built binary.
  ;; This is typically built using %GUILE-BOOTSTRAP-TARBALL below.
  ;;
  ;; XXX: Would need libc's `libnss_files2.so' for proper `getaddrinfo'
  ;; support (for /etc/services).
  (let ((raw (build-system
               (name 'raw)
               (description "Raw build system with direct store access")
               (lower make-raw-bag))))
   (package
     (name "guile-bootstrap")
     (version "2.0")
     (source #f)
     (build-system raw)
     (synopsis "Bootstrap Guile")
     (description "Pre-built Guile for bootstrapping purposes.")
     (home-page #f)
     (license lgpl3+))))

No source, really? So, no dependencies and no source, really? Is guile-bootstrap the origin of the Universe creating something from the Void? Guix is a damned good tool because it allows to inspect everything. How is guile-bootstrap built by Guix? Ask Guix and it answers: look at its derivation,

$ guix build guile-bootstrap --derivations
/gnu/store/5my8fmb3q4kpgmq0f16xffx6yai4rwbh-guile-bootstrap-2.0.drv

Sorry, we need to enter into the arcane of Guix. This guile-bootstrap is not created from nothing, it is the result of a build process (derivation) where the inputs and the builder are clearly and explicitly defined. Nothing less, nothing more.

Derive
([("out","/gnu/store/lgi9x15a0w35mcpd7g1kb9274r6wy4pv-guile-bootstrap-2.0","","")]
 ,[("/gnu/store/1k9p0fml8apa0c30wdyq5pifgifz5wks-tar.drv",["out"])
   ,("/gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv",["out"])
   ,("/gnu/store/nk17ldpmrh26z0fy3r7b4ipap2hqpm51-mkdir.drv",["out"])
   ,("/gnu/store/rm2vdddlz1maz5rv069j1zp3ma3ydg43-guile-2.0.9.tar.xz.drv",["out"])
   ,("/gnu/store/sd8g1dps3hr9kmy4c1v00pmigmk86fmw-xz.drv",["out"])]
 ,["/gnu/store/nmxl6qyj16bv4rx4irhg23r66gn752kd-build-bootstrap-guile.sh"]
 ,"x86_64-linux","/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash",["/gnu/store/nmxl6qyj16bv4rx4irhg23r66gn752kd-build-bootstrap-guile.sh"]
 ,[("GUILE_TARBALL","/gnu/store/3aigj659vsjxcrhn0r0vmkbjh3kj5pbs-guile-2.0.9.tar.xz")
   ,("out","/gnu/store/lgi9x15a0w35mcpd7g1kb9274r6wy4pv-guile-bootstrap-2.0")])

Let take a breath before diving deeper. Still following?

Analysing guile-bootstrap derivation

This guile-bootstrap derivation is composed by 3 parts:

  • the script builder: build-bootstrap-guile.sh
  • helpers: tar, bash, mkdir and xz
  • Guile binary: guile-2.0.9.tar.xz

The script builder reads,

echo "unpacking bootstrap Guile to '$out'..."
/gnu/store/jc2g1wcfwkxr7hindy29s744sgxn1w63-mkdir $out
cd $out
/gnu/store/qc9b01x31ayxb36r0zw5cw28awisdq98-xz -dc < $GUILE_TARBALL | /gnu/store/k015iy4mrrbd5vf2ihz40ai8swlvcj2p-tar xv

# Use the bootstrap guile to create its own wrapper to set the load path.
GUILE_SYSTEM_PATH=$out/share/guile/2.0 GUILE_SYSTEM_COMPILED_PATH=$out/lib/guile/2.0/ccache $out/bin/guile -c "(...)" $out /gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash

# Sanity check.
$out/bin/guile --version

Nothing fancy. It just creates a directory, extract the compressed content, and “use the bootstrap Guile to create its own wrapper to set the load path”. No trap here.

All the helpers are indeed binaries. For instance, look:

$ file $(guix build /gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv)
/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.30, stripped

$ $(guix build /gnu/store/c3l0q44a4h2g6p7r61gkssj8cxq6ar8x-bash.drv) --version
GNU bash, version 4.2.0(1)-release (i686-pc-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Concretely, how does it work? These 5 derivations for helpers and Guile describe where to fetch binaries and nothing more. These 5 derivations are fixed-output derivations. What does it mean?

Operations such as downloads for which the expected content hash is known in advance are modeled as fixed-output derivations. Unlike regular derivations, the outputs of a fixed-output derivation are independent of its inputs—e.g., a source code download produces the same result regardless of the download method and tools being used. Quoting section Derivations from the Guix manual.

Summarizing what we get

In all that story, which binary do we have to trust?

First, we must trust bootstrap-seeds. That’s nice because it is small enough for auditing the binary seeds. That’s already very excellent!

That’s said, although the package graph is rooted in bootstrap-seeds, second we also must trust the driver. It means: 1.3MiB for tar, 1.3MiB for bash, 0.7MiB for mkdir and 0.844MiB for xz. Last, 14MiB for uncompressed static Guile binary. Result, it is a bit less that 20MiB.

In the blog post, they count 25MiB and I miss where the 5MiB come from. Even considering bootar which is about 564KiB. Hum, is 25 an overestimation or is my 20 missing an usual suspect?

Opinionated next steps

The next question is how to rid of guile-bootstrap? Here, it potentially reads twofold related tasks:

  • decrease guile-bootstrap and its binary helpers: tar, bash, mkdir and xz,
  • build stage0-posix with less dependencies.

If gash and gash-utils could run with another Scheme implementation, then it would remove one dependency to guile-bootstrap. It is not clear for me that stage0-posix internally implements features that could be built before the need of gash and gash-utils. Else, I do not get why I could trust that other Scheme binary implementation?

Moreover, I am also confused by the need of bootar when the binary helper tar is already dragged in the picture.

Considering this current approach, the Scheme implementation named “driver“ will be still around, for one need or the other. Now all is reduced to these tiny bits, I do not see how it could be more elegant than implementing by hand directly in binary what needs bootstrap-seeds and stage0-posix for building.

Other said, it means manually implements a minimalist extractor and a minimalist shell directly in binary assembler. Is it affordable? I do not know… I only know that video folks have implemented dav1d – a AV1 decoder on all platforms mainly implemented in assembler. For the chicken-or-the-egg at hand, is the effort worth? And is such path better for transparency? Well, I would answer by another question: what does it mean “being verifiable”?

Any thought how to get rid of the driver? Or how to reduce again the binaries that need to be trust?

What is not explicitly told?

One of the biggest concern, in my humble opinion, about the current state of this awesome story is non-deterministic compilations. And especially at early stages, for example gash-boot.

$ guix build -e '(@@ (gnu packages commencement) gash-boot)'
$ guix build -e '(@@ (gnu packages commencement) gash-boot)' --check
guix build: error: derivation `/gnu/store/…-gash-boot-0.3.0.drv' may not be deterministic: output `/gnu/store/…-gash-boot-0.3.0' differs

From the section above about chicken-or-the-egg problem, here the condition #1 «  that each egg is correctly laid without introducing a genetic mutation  » is not satisfied. What you get is not what I get. Therefore, it weakens the strategy, not to say potentially defeats it. The old – still open – bug #20272 tracks the infrequent progress; difficult problem though.

Moreover, all this Bootstrap story only applies for packages where the C programming language is part of the bootstrapping mean. For instance, it is applicable for some Python based on the interpreter CPython. It is also applicable for the R language interpreter. Etc.

Which packages are not rooted in that bootstrap story so?

Only the packages that depends on programming languages that are not bootstrapped from the C programming language. Which ones? Concretely, at least all the packages where the programming languages OCaml or Haskell are involved.

These languages appears to you obscure, maybe, but do not think that you are probably not relying on them. Examples of popular program: unison for OCaml and pandoc for Haskell.

Concrete example. Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. It is used daily by bioinformatic folks; maybe run in production in some hospitals for helping in curing diseases as cancer. Relationship with Haskell programming language? The package Seurat depends on pandoc, hence Haskell, via four intermediary other packages. Trusting the computations of Seurat implies the trust of the Haskell bootstrap compiler (somehow, similar as previous guile-bootstrap).

Another example. The package python-pymc is library for probabilistic programming (formerly PyMC3) which helps for Bayesian statistical modeling focusing on advanced Markov chain Monte Carlo (MCMC) or variational inference algorithms. Nothing related with Haskell at first but this Python library requires for one of its dependencies an Haskell compiler.

What is the size of binaries we must trust for building the package r-seurat or python-pymc? It is more than 450MiB. Ouch, nothing less. How to check that? Let first determine which is the first Haskell compiler of the chain. Tricky:

$ guix graph -t bag-emerged pandoc | grep label | grep -E 'ghc-[0-9]' | cut -f1 -d'['
  "/gnu/store/8d2dsa57jww1ncyqf1fmj1vnjfx7ax1z-ghc-9.2.5.drv"
  "/gnu/store/1kd8x3kg22nwpmn97ihkzr15g1g1nk80-ghc-8.10.7.drv"
  "/gnu/store/b2lfx9psykr4lkj874kfrglysia2p3wp-ghc-8.6.5.drv"
  "/gnu/store/s63fpi10cd0d4h1pc4la0paj50r31g3d-ghc-8.4.4.drv"
  "/gnu/store/arybgkdkzrkwjcw1vnrq3k4s9nh6mxnd-ghc-8.0.2.drv"
  "/gnu/store/li4d0pri94qj17i0wiihvkfh7hchj7ix-ghc-7.10.3.drv"

Then let inspect how the package ghc@7.10.3 is built by giving a look to the derivation. There, we see the dependency with ghc-7.8.4-x86_64-unknown-linux-deb7.tar.xz and after extraction, bang! More than 450MiB of binaries inside. How many packages are impacted by that? Ask Guix:

$ guix refresh -l ghc@7 | cut -f1 -d':'
Building the following 741 packages would ensure 2181 dependent packages are rebuilt

Almost 2,200 packages rely on opaque 450MiB binary and are very far from the nice Bootstrap story above. Same story for OCaml, although the number of packages is less then hundred; considering the 3 supported versions of OCaml, it is a bit less than 200 packages.

And there is maybe other packages around that require a small set of external binaries.

The number of available packages that Guix offers is growing daily. Today, I count 27,813 packages – without counting extra channels – and subtracting OCaml, Haskell, etc., then more than 22,000 packages seem bootstrapped by less than 25MiB. Yeah, mind-blowing!

Yet another chicken-or-the-egg problem?

The bootstrapping problem for Haskell is not solved. And Ricardo works hard on it. Currently, from the older GHC around (4.08.2), which relies on gcc-2.95 – part of the Bootstrapping story above – it is possible to chain until version 6.10.4. Then versions 6.12.3 and 7.4.2 are not packaged yet for completing the Haskell chain from version 4.08.2 to modern version as 9.2.5; fully connecting the dots with bootstrap-seeds and dropping these 450MiB of binaries. The solution of this chicken-or-the-egg is not yet complete.

About OCaml, version 4.07 is connected to the Bootstrapping story. However, that’s not the case for the higher versions, as 4.14 or 5. The solution of this chicken-or-the-egg is done but not propagated then.

That’s a long post! Are we done now?

Wait, what do I need for running all these computations starting from bootstrap-seeds and friends? We assume that the binary guix-daemon is running, right? Euh?! How do you produce it? Another chicken-or-the-egg problem, no?

Join the fun, join Guix!



For reproducing purpose, Guix revision is 6113e05.


© 2014-2024 Simon Tournier <simon (at) tournier.info >

(last update: 2024-04-12 Fri 20:39)