Guix: intrinsic vs extrinsic identifier: toward more robustness?
This is slightly edited email sent to
guix-devel@gnu.org
. The idea is to write down my understandings and store it for later references. And it could be the base for another post on the same topic.
Hi,
I would like to open a discussion about how we identify the source origin (fixed output). It is of vitally importance for being robust on the long-term (say 3-5 years). It matters in Reproducible Research context, but not only.
First thing first
What is an intrinsic identifier or an extrinsic one?
- extrinsic: use a register to keep the correspondence between the identifier and the object; say label version as Git tag.
- intrinsic: intimately bound to the designated object itself; say hash as Git blob or tree and at some extent commit.
The register must be a trusted authority and it resolves by mapping the key identifier to the object. Having the object at hand does not give any clue about the key identifier. And collisions are very frequent; two key identifiers resolve to the same content – hopefully! we call that mirrors. ;-)
Intrinsic identifier also relies on a (trusted) map but collisions are avoided as much as possible. Somehow it strongly reduces the power of the authority and it is often more robust.
Please note that the identification and the integrity is not the same. Since intrinsic identifier often uses cryptographic hash functions and integrity too, it is often confusing.
Whatever the intrinsic identifier we consider – even ones based on very weak
cryptographic hash function as MD5
, or based on non-crytographic hash
function as Pearson hashing, etc. – the integrity check is currently done by
SHA256
.
For example, consider this source origin,
(source (origin (method url-fetch) (uri (string-append "mirror://gnu/hello/hello-" version ".tar.gz")) (sha256 (base32 "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd"))))
where mirror://gnu
is resolved by Guix itself. Or this one,
(source (origin (method git-fetch) (uri (git-reference (url "https://github.com/FluxML/Zygote.jl";) (commit (string-append "v" version)))) (file-name (git-file-name name version)) (sha256 (base32 "02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk"))))
where Guix clones then checks out at the specification of the field commit
.
Here both are extrinsic identifiers. For the first example, the register is
defined by %mirrors
. For the second example, the register is the folder
.git/
.
Intrinsic identifier could be plain hash or hashed serialized data. Using
Guix b8f6ead
:
$ guix hash -S none -H sha256 -f nix-base32 -x $(guix build hello -S) 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd $ guix hash -S git -H sha256 -f nix-base32 -x $(guix build hello -S) 11kaw6m19rdj3d55y4cygk6k9zv6sn2iz4gpimx0j99ps87ij29l $ guix hash -S nar -H sha256 -f nix-base32 -x /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz 1lvqpbk2k1sb39z8jfxixf7p7v8sj4z6mmpa44nnmff3w1y6h8lh
Or some Git-like tree md5 of the decompressed data, e.g.,
$ guix hash -S git -H md5 -f hex -x hello-2.12.1 3db60bcfecf17a5dd81e3fb5bfb1c191
Or some others.
$ git clone https://github.com/FluxML/Zygote.jl $ git -C Zygote.jl checkout v0.6.41 $ guix hash -S nar -H sha256 -f nix-base32 -x Zygote.jl 02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk $ guix hash -S git -H sha1 -f hex -x Zygote.jl 3cfdb31b517eec4173584fba2b1aa65daad46e09
Second thing second
All that’s said, Guix uses extrinsic identifiers for almost all origins, if
not all. Even for git-fetch
method.
Consider that GitHub disappears and the default build farms ci.guix and bordeaux.guix are unreachable for whatever reason. Then Guix will fallback to Software Heritage and will exploits its resolver.
Initialized empty Git repository in /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout/.git/ fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com Failed to do a shallow fetch; retrying a full fetch... fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com git-fetch: '/gnu/store/55ba5ragbd5sd4r45n0q24vrxx9rigrm-git-minimal-2.39.1/bin/git fetch origin' failed with exit code 128 Trying content-addressed mirror at berlin.guix.gnu.org... Trying content-addressed mirror at berlin.guix.gnu.org... Trying to download from Software Heritage... SWH: found revision 4777767737b4c95d2cea842933c5b2edae2771b2 with directory at 'https://archive.softwareheritage.org/api/1/directory/3cfdb31b517eec4173584fba2b1aa65daad46e09/' swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09/
That’s SWH which finds the revision 4777767737b4c95d2cea842933c5b2edae2771b2
from the contextual information URL + label version and from this revision SWH
associates the content having the intrinsic identifier
swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09
.
First, please note that the SWHID is just Git,
guix hash -S git -H sha1 -f hex \ /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout 3cfdb31b517eec4173584fba2b1aa65daad46e09
Other said, SWH information is somehow the same information as the one of Git objects. Specifically, from the Git checkout,
$ git cat-file -p v0.6.41 object 4777767737b4c95d2cea842933c5b2edae2771b2 type commit tag v0.6.41 $ git cat-file -p 4777767737b4c95d2cea842933c5b2edae2771b2 tree 3cfdb31b517eec4173584fba2b1aa65daad46e09
Second, SWH acts as a resolver here, i.e.,
(find (lambda (branch) (or ;; Git specific. (string=? (string-append "refs/tags/" tag) (branch-name branch)) ;; Hg specific. (string=? tag (branch-name branch)))) (snapshot-branches snapshot))
and this is not robust. For one, it fails for Git lightweight tag as
exposed with the package open-zwave
tag 1.6.
$ for t in $(git tag); do printf "$t "; git cat-file -t $t ;done Rel-1.0 commit V1.5 tag v1.2 commit v1.3 tag v1.4 tag v1.6 commit
It means that the code above would be able to find V1.5 or v1.4 but not v1.6
or v1.2. Well, we can consider that as a bug and improve the snapshot
machinery for also collecting more refs
. But, for two…
…the current code (guix swh)
does not deal with several snapshots and only
consider the latest one. Therefore, it fails for some in-place replacements –
upstream tags a specific revision then later removes it and upstream re-use
the same tag label for another revision booo!, if SWH ingests after the first
tag, SWH creates one snapshot, then if SWH ingests again after the second
re-tag, SWH creates another snapshot.
Third, Disarchive is helping.
Aside adding a layer to maintain does not help when speaking about long-term (3-5 years), well, the reduction of layers is often better for long-term. That’s said, there is a work in progress to have Disarchive features directly from SWH.
What does Disarchive do? It maps various intrinsic identifiers.
Remember hello
from above?
$ guix shell disarchive guile-lzma guile $ disarchive disassemble hello-2.12.1 (disarchive (version 0) (directory-ref (version 0) (name "hello-2.12.1") (addresses (swhid "swh:1:dir:ad5fc7c3062e8426b7936588e7a27d51ace0e508")) (digest (sha256 "cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4")))) $ guix hash -S git -H sha1 -f hex hello-2.12.1 ad5fc7c3062e8426b7936588e7a27d51ace0e508 $ guix hash -S git -H sha256 -f hex hello-2.12.1 cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
Well, the fixed-outputs is a compressed tarball, it reads,
$ disarchive disassemble $(guix build -S hello) (disarchive (gzip-member (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz") (digest (sha256 "8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 2707092614) (isize 4945920)) (compressor gnu-best-rsync) (input (tarball (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar") (digest (sha256 "a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554")) [...] (input (directory-ref (version 0) (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1") (addresses (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a")) (digest (sha256 "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
where the values are just (considering that guix hash -S none -H sha256 -f
hex
is equivalent to sha256sum
)
$ guix hash -S none -H sha256 -f hex $(guix build hello -S) 8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20 $ gzip -d $(guix build -S hello) -c | sha256sum a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554 -
However the fields swhid
and the other SHA256 digest
are different from
above. That’s because the dots […] part. It probably comes from the
normalization process. Well, I am not sure to deeply understand why it is
different but that’s another story. :-)
Fourth, it misses a bridge using NAR normalization (serialization).
Disarchive can (or could) provides a bridge (map) between SWHID+SHA1 and NAR+SHA256. But it could be nice if it was implemented in SWH directly. It would ease previous drawbacks.
For the interested reader, discussion there. Moreover, this discussion provides simple examples about NAR and how to implement it using Python.
Discussion asking for comments and feedback
Still there? If yes, thanks for reading. :-)
As shown in,
- https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html
- https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00007.html
we have holes and we are not currently robust for long-term (3-5 years) if our lovely build-farms are down for whatever reasons.
For sure, we have to fix the holes and bugs. :-) However, I am asking what we could add for having more robustness on the long term.
It is not affordable, neither wanted, to switch from the current extrinsic identification to a complete intrinsic one. Although it would fix many issues. ;-)
Guix and guix time-machine
provides all the machinery for being able to
redeploy later but as I have tried to point in the two links above [1,2], we
are lacking tools for retrieving contents; well having the machinery does not
mean that such machinery works well or is robust. :-)
The discussion could also fit how to distribute using ERIS.
At some point, I was thinking to have something like “guix freeze -m manifest.scm” returning a map of all the sources from the deep bootstrap to the leaf packages described in manifest.scm. However, maybe something is poor in the metadata we collect at package time.
For instance, the substitutions work more or less using intrinsic identifier so it helps, I guess. :-)
Well, we could imagine the addition of another option field, say under
properties
, that could store the intrinsic identifier of the fixed-outputs
such as SWHID or Git tree / commit hash or else. It would add robustness for
later.
Or maybe an optional field of the origin
record for the same purpose.
WDYT?
Cheers, simon
Any private feedback by email is very welcome. :-)
Join the fun, join Guix!