Back from FOSDEM: Guix + Software Heritage = ♥

Wow, I am still recovering from these intensive days! It was a so enjoyable experience. So refreshing to team up with people with very diverse backgrounds and collaborate altogether for mixing ideas, feedback, or experience. It energizes for the rest of the year.

Yeah, a packed week: working group meeting of French network for Reproducible Research about reproducible computational environment on Monday, talk about Guix: a factory for container? part of Café Guix on Tuesday, Software Heritage Symposium at UNESCO on Wednesday, Software Heritage enthusiasts session on Thursday, Guix Day on Friday, talk about Guix and Software Heritage at FOSDEM, and… oops I skipped Declarative and Minimalistic Computing devroom on Sunday, instead enjoyed the nice sun of Brussels and visited a friend. Yeah, the good kind of packed week: the one that energizes for the rest of the year!

Special thanks to Software Heritage team for all the organization, to Tanguy and Pjotr for making possible the Guix Days, and to Open Research devroom people for this very devroom. That’s very great to have this opportunity booster!

In the talk « Guix + Software Heritage: Source Code Archiving to the Rescue of Reproducible Deployment », the Too Long, Don’t Watch message’s:

Reproducible research and your future-self will thank you for these both choices. Bet?

In this post, I would like to highlight a couple of points that had been presented in this talk.

For more details, please read the paper already presented at the conference ACM REP '24 (Second ACM Conference on Reproducibility and Replicability). For a short previous blog post, see here.

Paper: Source Code Archiving to the Rescue of Reproducible Deployment

The paper is co-authored by Ludovic Courtès, Timothy Sample and Stefano Zacchiroli. And under the hood to make it Just Works™, special thanks to three Antoine:

What's Guix? Guix is a software deployment tool that supports reproducible software deployment. As research results are increasingly the outcome of computational processes, software plays a central role. The ability to verify research results and to experiment with methodologies, core tenets of the scientific methods, requires reproducible software deployment.

What's Software Heritage? Software Heritage is a long term, non-profit, multistakeholder initiative with the ambitious goal to collect, preserve and share all source code publicly available. To our knowledge, Software Heritage is the largest publicly available archive of software source code.

Missing context about Software Heritage, feel free to give a look to my personal questions/answers from a community session of the past past year.

Could we connect Guix with Software Heritage? Yes! It makes Guix the first free software distribution and tool backed by Software Heritage, to our knowledge.

Ok, and what’s the key point? The key is content-address! Both Guix and Software Heritage rely on “intrinsic identifier” for identifying source code. Please use1 inherent identifiers instead of version labels.

Software Heritage relies on SWHID format identifier. The key component here is about Merkle tree. Long story short, the first2 version (swh:1:) of the SWHID format specifies two parts: 1. a object-type (cnt for a content, dir for a directory, etc.) – each object-type matches a node-type of the Merkle tree – and 2. the hash checksum of the object itself.

As a regular day-to-day user of Software Heritage and/or Guix, you do not need to dive into all these details below. Although sometimes having technical overview helps in grabbing the Big Picture. Last but not least, please provide context qualifier using origin, visit and anchor when referencing or citing source code with SWHID.

So far, so good, now the plumbing! The first example: a plain file; let run this command,

$ guix hash -S git -f hex -H sha1 COPYING
94a9ed024d3859793618152ea559a168bbcbb5e2

and then browse: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2. Yeah, GNU General Public License version 3.

What does the Guix command mean?

Another example? Let run the command against the folder containing all the Guix packages.

$ guix hash -S git -f hex -H sha1 gnu/packages
ae2b06330a2086afc55671ce929ceabe1bf57441

and then browse: https://archive.softwareheritage.org/swh:1:dir:ae2b06330a2086afc55671ce929ceabe1bf57441. Yeah, all the Guix packaging recipes.

Be careful, the first example uses the Software Heritage object-type cnt because it’s just one file when the second example uses the Software Heritage object-type dir because it lists a directory. Note, once followed the Software Heritage hyperlink, please note: One the right side you have “Permalink”, if you click, then you will see the object-type.

The “intrinsic identifier” is all about that: it provides the address of the content. If you have at hand this identifier, then it’s easy to fetch back the content and verify it matches the identifier.

So what is complicated? Ah maybe you know xkcd 927: Situation, There are 14 competing standards. 14?! Ridiculous! We need to develop one universal standard that covers everyone’s use cases. Situation, There are 15 competing standards.

Guix does not rely on SWHID standard but Guix identifies source code with sha256 Nix-base32 Normalized ARchive (nar). The specification of NAR is detailed in Eelco Dolstra’s PhD thesis (see figure 5.2 page 93). Well, Guix predates Software Heritage and switching from “NAR Nix-base32 SHA-256“ to “Git Hex SHA-1“ is not affordable at this scale. Moreover, Guix uses this identifier as an integrity checksum and we know that SHA-1 is vulnerable. Well, the two kind of addresses cover two different needs.

Concretely? Let fetch the source code of some project using Mercurial as version control system.

$ guix hash -S nar -f nix-base32 -H sha256 \
    $(guix build --source hg-commitsigs)
059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x

And compare with the Guix recipe of the package.

(source (origin
          (method hg-fetch)
          (uri (hg-reference
                (url "https://foss.heptapod.net/mercurial/commitsigs")
                (changeset changeset)))
          (sha256
           (base32
            "059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x"))))

Ah?! And Software Heritage (“Git Hex SHA-1“) reads,

$ guix hash -S git -f hex -H sha1 \
    $(guix build --source hg-commitsigs)
bfcd0414b593db8c1ebac1b489556db91197abd7

Doom? No, we need a bridge! And that’s what had been built: Software Heritage attaches an external identifier as an extra meta data. So then, feeding a dedicated entry-point with some “NAR SHA-256“ identifier, Software Heritage resolves it to SWHID (“Git Hex SHA-1“). Cool, isn’t it?

As we said above, the format (hex or nix-base32) does not matter. It’s straighforward to transform from one to the other. Well, Python implements in its standard library the format base32 (and more!) but not the exotic format nix-base32. Yeah, a Python implementation would be nice; hey next time for another post.

The conversion between format is a simple composition: from nix-base32 to an internal representation (bytevector) then to hex (also named base16). Nothing more with the Scheme file hex-of-Nix-base32.scm.

(use-modules ((guix base32) #:select (nix-base32-string->bytevector))
             ((guix base16) #:select (bytevector->base16-string))
             (ice-9 match))

(match (command-line)
  ((_ hex)
   (display

    (bytevector->base16-string
     (nix-base32-string->bytevector
      hex))))

  (_
   (display "Wrong number of arguments")))
(newline)

Let use this Scheme script and convert the hash from nix-base32 to hex. Nothing fancy!

$ guix repl -- \
    hex-of-Nix-base32.scm 059gm66q06m6ayl4brsc517zkw3ahmz249b6xm1m32ac5y24wb9x
3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15

And now, let query Software Heritage with this “NAR Hex SHA-256” address under the hex format.

$ ID=3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15
$ curl -s https://archive.softwareheritage.org/api/1/extid/nar-sha256/hex:${ID}/?extid_version=1 | jq
{
  "extid": "3d2d4e842f4c895143ed6625227e856af0f94f284ce745a857a61a808da92f15",
  "extid_type": "nar-sha256",
  "extid_version": 1,
  "target": "swh:1:dir:13d12be3b269428bb69d2b3bb77faf8f4524ac86",
  "target_url": "https://archive.softwareheritage.org/swh:1:dir:13d12be3b269428bb69d2b3bb77faf8f4524ac86"
}

And bang3! We get back the expected SWHID (“Git Hex SHA-1”). Awesome!

Opening remarks

In summary, Guix feeds Software Heritage by two means:

  1. guix lint -c archival: It sends a “Save” query to Software Heritage for the package at hand. Be careful, for now, it only works when the package fetches its source code (origin) using Git.
  2. sources.json (link): It lists all4 Guix packages and Software Heritage loads them. Note, Software Heritage re-compute the various hashes and uses the ones provided as double-check verification

Once the source code disappears5, Guix queries Software Heritage and then Software Heritage cooks the content. It works most of the time but it’s not yet bullet proof.

Still there? Maybe you’re waiting an answer about: How do you deal with compressed tarballs? The short answer: Disarchive. A longer answer: Check out the talk. An even longer answer: Check out the paper.

Join the fun, join Guix! Join Software Heritage!

Footnotes:

1

Yes, Zooko’s triangle but human-readable appears here less important than secure and decentralized.

2

Yes, it’s very close to how Git is implemented! :-) See Git internals.

3

Attentive reader will see the discrepancy between bfcd04 and 13d12b. Let as an exercise why. Hint: How to deal with the folder .hg?

4

All Guix packages: all packages coming with Guix proper. For now, any channels are not considered, yet.

5

Link-rot: ~10% after 5 years.


© 2014-2024 Simon Tournier <simon (at) tournier.info >

(last update: 2025-02-28 Fri 17:51)