Back from FOSDEM: Guix + Software Heritage = ♥
Wow, I am still recovering from these intensive days! It was a so enjoyable experience. So refreshing to team up with people with very diverse backgrounds and collaborate altogether for mixing ideas, feedback, or experience. It energizes for the rest of the year.
Yeah, a packed week: working group meeting of French network for Reproducible Research about reproducible computational environment on Monday, talk about Guix: a factory for container? part of Café Guix on Tuesday, Software Heritage Symposium at UNESCO on Wednesday, Software Heritage enthusiasts session on Thursday, Guix Day on Friday, talk about Guix and Software Heritage at FOSDEM, and… oops I skipped Declarative and Minimalistic Computing devroom on Sunday, instead enjoyed the nice sun of Brussels and visited a friend. Yeah, the good kind of packed week: the one that energizes for the rest of the year!
Special thanks to Software Heritage team for all the organization, to Tanguy
and Pjotr for making possible the Guix Days, and to Open Research devroom
people for this very devroom. That’s very great to have this opportunity
booster!
In the talk « Guix + Software Heritage: Source Code Archiving to the Rescue of Reproducible Deployment », the Too Long, Don’t Watch message’s:
- Cite and Reference your source code with Software Heritage identifier;
- Build your computational environment with Guix.
Reproducible research and your future-self will thank you for these both choices. Bet?
In this post, I would like to highlight a couple of points that had been presented in this talk.
For more details, please read the paper already presented at the conference ACM REP '24 (Second ACM Conference on Reproducibility and Replicability). For a short previous blog post, see here.
The paper is co-authored by Ludovic Courtès, Timothy Sample and Stefano Zacchiroli. And under the hood to make it Just Works™, special thanks to three Antoine:
- Antoine Eiche for the first implementation of
sources.json
and the associated loader;- Antoine R. Dumont for reworking the loader;
- Antoine Lambert for helping in crossing the final line.
What's Guix? Guix is a software deployment tool that supports reproducible software deployment. As research results are increasingly the outcome of computational processes, software plays a central role. The ability to verify research results and to experiment with methodologies, core tenets of the scientific methods, requires reproducible software deployment.
What's Software Heritage? Software Heritage is a long term, non-profit, multistakeholder initiative with the ambitious goal to collect, preserve and share all source code publicly available. To our knowledge, Software Heritage is the largest publicly available archive of software source code.
Missing context about Software Heritage, feel free to give a look to my personal questions/answers from a community session of the past past year.
Could we connect Guix with Software Heritage? Yes! It makes Guix the first free software distribution and tool backed by Software Heritage, to our knowledge.
Ok, and what’s the key point? The key is content-address! Both Guix and Software Heritage rely on “intrinsic identifier” for identifying source code. Please use1 inherent identifiers instead of version labels.
Software Heritage relies on SWHID format identifier. The key component here
is about Merkle tree. Long story short, the first2 version (swh:1:
) of the
SWHID format specifies two parts: 1. a object-type (cnt
for a content, dir
for a directory, etc.) – each object-type matches a node-type of the Merkle
tree – and 2. the hash checksum of the object itself.
As a regular day-to-day user of Software Heritage and/or Guix, you do not need to dive into all these details below. Although sometimes having technical overview helps in grabbing the Big Picture. Last but not least, please provide context qualifier using
origin
,visit
andanchor
when referencing or citing source code with SWHID.
So far, so good, now the plumbing! The first example: a plain file; let run this command,
$ guix hash -S git -f hex -H sha1 COPYING 94a9ed024d3859793618152ea559a168bbcbb5e2
and then browse:
https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
.
Yeah, GNU General Public License version 3.
What does the Guix command mean?
-S git
: read the data as the same way as Git; in short it adds some bits:$ (echo -en "blob $(cat COPYING | wc -c)\0" ; cat COPYING) | sha1sum 94a9ed024d3859793618152ea559a168bbcbb5e2 -
Warning: it does not matter if the file is tracked by Git or not; it only reads the data as the same way as Git whatever the version control system is or even if there is none.
-f hex
: hexadecimal format. Warning: it does not matter and it’s easy to switch from hexadecimal tobase64
or whatever else.-H sha1
: the algorithm for computing the hash of the checksum.
Another example? Let run the command against the folder containing all the Guix packages.
$ guix hash -S git -f hex -H sha1 gnu/packages ae2b06330a2086afc55671ce929ceabe1bf57441
and then browse:
https://archive.softwareheritage.org/swh:1:dir:ae2b06330a2086afc55671ce929ceabe1bf57441
.
Yeah, all the Guix packaging recipes.
Be careful, the first example uses the Software Heritage object-type cnt
because it’s just one file when the second example uses the Software Heritage
object-type dir
because it lists a directory. Note, once followed the
Software Heritage hyperlink, please note: One the right side you have
“Permalink”, if you click, then you will see the object-type.
The “intrinsic identifier” is all about that: it provides the address of the content. If you have at hand this identifier, then it’s easy to fetch back the content and verify it matches the identifier.
So what is complicated? Ah maybe you know xkcd 927: Situation, There are 14 competing standards. 14?! Ridiculous! We need to develop one universal standard that covers everyone’s use cases. Situation, There are 15 competing standards.
Guix does not rely on SWHID standard but Guix identifies source code with
sha256
Nix-base32 Normalized ARchive (nar
). The specification of NAR is
detailed in Eelco Dolstra’s PhD thesis (see figure 5.2 page 93). Well, Guix
predates Software Heritage and switching from “NAR Nix-base32 SHA-256“ to “Git
Hex SHA-1“ is not affordable at this scale. Moreover, Guix uses this
identifier as an integrity checksum and we know that SHA-1 is vulnerable.
Well, the two kind of addresses cover two different needs.
Concretely? Let fetch the source code of some project using Mercurial as version control system.
$ guix hash -S nar -f nix-base32 -H sha256 \
$(guix build --source hg-commitsigs)
059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x
And compare with the Guix recipe of the package.
(source (origin (method hg-fetch) (uri (hg-reference (url "https://foss.heptapod.net/mercurial/commitsigs") (changeset changeset))) (sha256 (base32 "059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x"))))
Ah?! And Software Heritage (“Git Hex SHA-1“) reads,
$ guix hash -S git -f hex -H sha1 \
$(guix build --source hg-commitsigs)
bfcd0414b593db8c1ebac1b489556db91197abd7
Doom? No, we need a bridge! And that’s what had been built: Software Heritage attaches an external identifier as an extra meta data. So then, feeding a dedicated entry-point with some “NAR SHA-256“ identifier, Software Heritage resolves it to SWHID (“Git Hex SHA-1“). Cool, isn’t it?
As we said above, the format (hex
or nix-base32
) does not matter. It’s
straighforward to transform from one to the other. Well, Python implements in
its standard library the format base32
(and more!) but not the exotic format
nix-base32
. Yeah, a Python implementation would be nice; hey next time for
another post.
The conversion between format is a simple composition: from nix-base32
to an
internal representation (bytevector
) then to hex
(also named base16
).
Nothing more with the Scheme file hex-of-Nix-base32.scm
.
(use-modules ((guix base32) #:select (nix-base32-string->bytevector)) ((guix base16) #:select (bytevector->base16-string)) (ice-9 match)) (match (command-line) ((_ hex) (display (bytevector->base16-string (nix-base32-string->bytevector hex)))) (_ (display "Wrong number of arguments"))) (newline)
Let use this Scheme script and convert the hash from nix-base32
to hex
.
Nothing fancy!
$ guix repl -- \
hex-of-Nix-base32.scm 059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x
3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15
And now, let query Software Heritage with this “NAR Hex SHA-256” address under
the hex
format.
$ ID=3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15 $ curl -s https://archive.softwareheritage.org/api/1/extid/nar-sha256/hex:${ID}/?extid_version=1 | jq { "extid": "3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15", "extid_type": "nar-sha256", "extid_version": 1, "target": "swh:1:dir:13d12be3b269428bb69d2b3bb77faf8f4524ac86", "target_url": "https://archive.softwareheritage.org/swh:1:dir:13d12be3b269428bb69d2b3bb77faf8f4524ac86" }
And bang3! We get back the expected SWHID (“Git Hex SHA-1”). Awesome!
Opening remarks
In summary, Guix feeds Software Heritage by two means:
guix lint -c archival
: It sends a “Save” query to Software Heritage for the package at hand. Be careful, for now, it only works when the package fetches its source code (origin
) using Git.sources.json
(link): It lists all4 Guix packages and Software Heritage loads them. Note, Software Heritage re-compute the various hashes and uses the ones provided as double-check verification
Once the source code disappears5, Guix queries Software Heritage and then Software Heritage cooks the content. It works most of the time but it’s not yet bullet proof.
Still there? Maybe you’re waiting an answer about: How do you deal with compressed tarballs? The short answer: Disarchive. A longer answer: Check out the talk. An even longer answer: Check out the paper.
Join the fun, join Guix! Join Software Heritage!
Footnotes:
Yes, Zooko’s triangle but human-readable appears here less important than secure and decentralized.
Yes, it’s very close to how Git is implemented! :-) See Git internals.
Attentive reader will see the discrepancy between bfcd04
and
13d12b
. Let as an exercise why. Hint: How to deal with the folder .hg
?
All Guix packages: all packages coming with Guix proper. For now, any channels are not considered, yet.
Link-rot: ~10% after 5 years.