Café Guix: what about long term?

In this Café Guix, I talked about long term: how to redo later and elsewhere what we have done now and here? It was the occasion to speak about graph of dependencies and Software Heritage (SWH) archiving all the source codes. The Guix project is working hard to archive to SWH and then be able to reuse source code from SWH. The PDF slides (in French) are available and the source are here.

Please give a look at these pointers: Guix and long-term archiving in action! or Which problems Scientist has that Guix solves. Here I would like to focus on Software Heritage, specifically.

the message you should get back to home

For allowing long-term computational reproducibility, Guix must have at hand :

all the source code
backward-compatibility of the Linux kernel
some compatibility of the hardware (CPU, etc.)

And the core question is: what is the temporal window where all these 3 conditions hold? To my knowledge, the Guix project is unique in experimenting for real about this window since v1.0 in 2019.

Guix allows to transparently track all the computational environment, and then anyone is able to study bug-to-bug. Somehow, Guix should manage everything,

guix time-machine -C channels.scm -- shell -m manifest.scm

if it is specified:

how to build (channels.scm)
what to build (manifest.scm)

Therefore, the corollary becomes: in addition to these two files, what do we need to collectively preserve?

In short, the important information to preserve is « how to build » (identically). Guix builds packages and under the hood, these packages are defined by a recipe which looks like,

(package
  (name "python-pytorch")
  (version "1.10.2")
  (source ...)
  (build-system python-build-system)
  (arguments ...)
  (inputs (list eigen fp16 ...)))

and here two information must be preserved:

the source code (information provided by the field source),
the recipe itself allowing to build (identically) this source code.

The keyword python-build-system hides many steps specific to Python packaging and potentially modified by the field arguments, in order to stay consistent with the rest of Guix. Moreover, this package python-pytorch depends on other packages, here eigen or fp16, and thus the recipe of these packages and their source code must be preserved too.

The recipes themselves belong to some Git repository named channel. Then each software project distributes its source code using various methods:

archive tarballs (compressed)
Git repository
Subversion repository
Mercurial repository
CVS repository

and Guix provides a method to fetch each of them (url-fetc, git-fetch, svn-fetch, hg-fetch and cvs-fetch). Considering the packages Guix offer by default, the distribution looks like,

$ guix repl -- sources.scm | sort | uniq -c | sort -nr
13432 url-fetch
6691  git-fetch
391   svn-fetch
43    other
31    hg-fetch
3     cvs-fetch

where sources.scm (see here) traverses all the packages and extract their source type.

So far, so good! What is the problem?

Content from Internet is ephemeral and the link-rot is one strong concern, ruining many scientific publications. What are the guarantees that the source code hosted on popular forges (say as GitHub or Gitlab or else) will be still there some time later? Aside the person in charge of the source code publication can change their interest, thus often leading to vanished pointers, the hosting service can also vanish. Examples of past popular vanished forges: Google Code (closed in 2016), Alioth (from Debian replaced by Salsa in 2018), Gna! (closed in 2017 after 13 of active years), Gitorious (closed in 2015 being at some point the second most popular hosting Git service), etc.

Even, Guix folk is aware of the phenomenon and is working hard to build counter-measures, and still, vanished source code falls through the cracks. See issue #42162.

Believe it or not, gforge.inria.fr was finally phased out on Sept. 30th.
And believe it or not, despite all the work and all the chat :-), we lost
the source tarball of Scotch 6.1.1 for a short period of time (I found a copy
and uploaded it to berlin a couple of hours ago).

Therefore, we collectively need an archive to preserve all the source on the long term.

Wait, it is the mission of Software Heritage to collect, preserve and share software in source code form. Please note that an archive is different from a forge. The mission of the former is the accumulation of historical records or materials for permanent or long-term preservation, whereas the aim of the latter is collaborative software platform for both developing and sharing software.

Hm, we have just said that online services sometime stop… Why would it be different for Software Heritage? No guarantee, however Software Heritage is heavily supported by international public institutes. Quoting:

Software Heritage is an open, non-profit initiative unveiled in 2016 by Inria. It is supported by a broad panel of institutional and industry partners, in collaboration with UNESCO.

The long term goal is to collect all publicly available software in source code form together with its development history, replicate it massively to ensure its preservation, and share it with everyone who needs it.

Software Heritage, the floor is yours! Godspeed and all the best for broad success.

Guix is able to fallback to Software Heritage when the original location vanished, assuming the original location had been archived. Moreover, the Guix project is working hard to preserve Guix itself and all the packages. Well, the devil is in the details, although Guix paves the way, a bullet-proof out-of-the-box solution is not completely here yet; help welcome.

For an example of this fallback mechanism in action, give at look at this post.

opinionated closing words

When speaking about long-term preservation, more than often I hear about archiving (ingesting the content to the archive itself) and barely about looking up this content back. The core question is how to identify the piece of information?

Most of the projects refer to some specific versions with a tag label, as v1.2.3. For sure, it is really handy and practical for human communication. However, is it well-designed for referencing at large?

The main issue is that a tag label is not self-referencing but depends on an external authority. For instance, the current label version of my project is at v1.2 and I decide by myself (authority) to call all the new cool features with the label version v1.3 and nothing prevent me to name it v2.0 instead. For sure, policy as semantic version are very helpful but it does not guarantee that an independent observer would be able to check that both of us are speaking about the exact same source code.

Often, the counter-measure is to provide, aside the label version, some integrity checksum. This integrity checksum identifies uniquely the source code (modulo hash collision). And the checksum is self-referencing; independent observers are able to check by themselves that all are referring to the exact same content.

It is about intrinsic identifier (checksum) vs extrinsic identifier (label tag). Referencing by intrinsic identifier is much better for long-term preservation, for sure. Now, the question is which algorithm for computing this intrinsic identifier? The well-know trap of new standards.

Join the fun, join the initiative Café Guix!