Redoing our paper in Nature Scientific Data (Oct. 2022)

transparent, verifiable and long-term… one year later

Some days ago, I prepared a demo for my talk at Pasteur Institute. Why not try out one example from the paper Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data, vol. 9, num. 597. Well, I spent the morning just before the talk checking out that example for demoing… And the computational environment reproduction failed. This post investigates and draws counter-measures. The aim here is twofold:

  1. Expose the roadblocks about reproducing a computational environment; here the reproduction of one computational environment distant of three years.
  2. Show how Guix project helps; from technical workarounds to work-in-progress features.

Our paper had been published on October 2022 and we demoed using a published paper from 2020. The Guix revision (channel file) – the one we published in 2022 – had been thus picked from an older Guix revision – mimicking a Guix channel file as it would have been published by the paper in 2020. To be precise, the selected Guix revision we published1 was 1971d11db9 (April, 14th 2020). And as explained in our paper, we chose this revision in order to get Bioconductor 3.10, as we extrapolated from the published material.

Other said, the command-line,

guix time-machine -C channels.scm \
     -- environment -C -L my-pkgs -m manifest.scm

builds the exact same computational environment as it was in April 2020. This command-line was just working out-of-the box in October 2022. Now, more than one year later, this very exact same command-line fails.

Long story short: today, three years later after one publication, the link-rot defeats the ability of checking this very published result. The failure starts very early in the attempt of reproducing: it starts at getting all the source code for composing the computational environment.

Evil is about details. Let review them!

Disclaim. Re-run and compare would be another goal. I think “re-run and compare” is a tangential corollary of the main objective: reproduce the exact same computational environment. Somehow, if we run inside the exact same computational environment, then the experimental conditions of the computations are (almost) bounded, and so “re-run and compare” is just repeat yet another same experiment so re-generating equivalent data. In addition, for many results, it is not affordable to “re-run and compare” because the cost (energy consumption, computing resource availability, CPU time, memory, etc.) is too high or does not worth it. To me, the scientific method applied to the computational environment means its full source-to-binary transparency, other said, the ability to deeply audit and verify all, in the long term. That’s the first main objective, then, if completed and if required, secondly we could apply variations about the computational environment in order to challenge the conclusion under study.

How to rebuild the past?

Before jumping into details, what do we mean by “computational environment” here? This computational environment is composed by the direct packages required by the analysis and also by all the indirect packages not explicitly listed. How many packages do we speak about? More than 720! Dependencies of dependencies matter, perhaps.

Reproducing the exact same computational environment means that we must compose again from more than 720 source code, and today on 2023 as it was in 2020. This asks two important questions:

  1. How to identify a source code?
  2. How to build the identical computational environment composed by more than 720 components?

My opinionated answer is: label version as with “I used flowCore at version 1.52.1” is not enough because,

  1. it identifies poorly the source code,
  2. it does not scale; more than 720 label versions should be provided.

Ready for the arcane of rebuilding the past? Let’s go!

First thing first

The computational environment is described by the manifest file,

(specifications->manifest
 (list
  ;; Packages from Guix collection
  "r"
  "r-rtsne"
  "r-pheatmap"
  "r-rcolorbrewer"
  "r-ncdfflow"
  "r-edger"
  "r-flowcore"
  "r-dplyr"
  "r-combinat"
  "r-rmarkdown"                         ;render .Rmd files

  ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm
  "r-cydar"
  ))

and all the versions of all these packages are implicitly defined by the channel file, which reads in this case,

(list (channel
        (name 'guix)
        (url "https://git.savannah.gnu.org/git/guix.git")
        (commit
          "1971d11db9ed9683d5036cd4c62deb564842e1f6")))

Here, only the “Guix channel” is required. More channels could also be added for extending the package collection. In our case, the package collection had been locally extended with one package defined in the directory my-pkgs.

Complete transparency

The commitment that the Guix project tries to demonstrate is, from my point of view, unique and a real challenge: provide tools that produce the same computational environment at two distant points, in time and space. Concretely running this command-line,

$ guix time-machine -C channels.scm \
       -- environment -C -L my-pkgs -m manifest.scm

outputs this long list:

guile: warning: failed to install locale
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
substitute: updating substitutes from 'https://bordeaux.guix.gnu.org'... 100.0%
The following derivations will be built:
   /gnu/store/q0bb9n6blh4nbs6icmzfasymcwr5wkgd-r-3.6.3.drv
   /gnu/store/d52csrci11ngcxmzp04n9mdpjn207h7r-r-mass-7.3-51.5.drv
   /gnu/store/4xk7sl860dhf134in2gzfkbw6hc9z3an-MASS_7.3-51.5.tar.gz.drv
   /gnu/store/hl17dqn0wfc7wdf4c30daqvd5zgh5bky-r-codetools-0.2-16.drv
   /gnu/store/5257ay3c3qkiy51yv4frsvmjmh3hxk08-codetools_0.2-16.tar.gz.drv
[...]
   /gnu/store/3sp4a864ax4cl8k7mpbmnxgbrvrcmvy8-gcc-7.4.0.drv
   /gnu/store/9b5swsrwd1z7lz6r9b1w3jdzyc75nvsx-ghc-8.0.2-src.tar.xz.drv
   /gnu/store/7k0qyy1s0clja7g1967ny8wsjlyy7izs-ghc-8.0.2.drv
   /gnu/store/03pbyq29ip4827h871y7p4bqsd8y0y1y-ghc-8.6.5.drv
   /gnu/store/31916d6jvgjwahvd28yipbpyrfrivmiq-ghc-pandoc-types-1.17.6.1.drv
   /gnu/store/2gwaaigspkzsa146ykfzdaingcx9kfjj-ghc-test-framework-hunit-0.3.0.2.drv
   /gnu/store/0h0c8pcbb34g8p6jxw09gw6km0frppd6-ghc-extensible-exceptions-0.1.1.4.drv
   /gnu/store/2srxxhp7dx4p286qk9rvrrywh4mkbgpy-ghc-pandoc-2.7.3.drv
[...]
   /gnu/store/rd9yb2pci9xsr71dhf1k90gnmqhd513i-clisp-2.49-92.drv
   /gnu/store/ahi2sb681pz13a9sfv2hd8r77a5rb88v-clisp-2.49-92.tar.xz.drv
   /gnu/store/b03g73xpi16dyh762r2s39l2bvd40vif-sbcl-parse-js-0.0.0-1.fbadc6029.drv
   /gnu/store/fbz1kkihs51pgzmhfvqfw2xwawlycvcn-sbcl-iterate-1.5.drv
   /gnu/store/gmcvf18cpkdadz6h53l88nlnmh81b1r1-sbcl-rt-1990.12.19-1.a6a7503.drv
[...]

building /gnu/store/y1bnskpk88qh1adw7hpvds125m35p8xp-r-minimal-3.6.3.drv...
- 'build' phase

What does it mean? It means that the computational environment is completely transparent and verifiable. Namely, these files ending with .drv. describe how to build or where to fetch source code. We have access to all the details for producing all the binary artifacts from the source code.

First, note that the Guix project provides pre-built substitutes and these binary artifacts from 2020 are gone. Other said, we need to download the source code and locally build them. For instance, we see that we are building the package named r-minimal and then we will download the source code for the R library MASS in order to build the corresponding Guix package r-mass. And so on.

Second, have you noticed these ghc items? They are Haskell libraries and required for Pandoc – an universal document converter. The Guix package ghc-pandoc is indirectly required by R libraries as ncdfFlow. In addition, we can also see clisp or sbcl which are from Common Lisp ecosystem. Have you guessed them beforehand? Me, not! But told you – more than 700 dependencies of dependencies.

Note. One might ask if one of these dependencies of dependencies matters for the ending result of the analysis and it is legitimate. Without using Guix, I would not be able to start an answer – confirm or contradict my intuition. Other said, the scientific method implies that we need to challenge the hypothesis – say it has no impact – by testing some variations. Therefore, we need two things: on one hand, a reference point and on the other hand the capacity to generate fully controlled variations of that reference point. As experimenters would do, they would run experiments varying parameters carefully selected under controlled conditions and the output of these experiments checked against one control output.

The unexpected failure?

Ah, the previous command-line quickly fails. The error is about the R library BiocNeighbors. It is important to notice that this R library is not explicitly required by the analysis but appears to be an indirect dependency – required by the package r-cydar.

It fails because Guix tries to fetch the source code from the location encoded in Guix package definition from 2020 and the content at that location is gone today in 2023. Yet another observation of the well-known link-rot phenomenon.

building /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv...
-builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz'
build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed
View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'.
building /gnu/store/smky0svf3gw5dz8ajj0im7kr7mzv12lr-BiocParallel_1.20.1.tar.gz.drv...
cannot build derivation `/gnu/store/xnmx8c5jgksv56g4qhsr17fsm62qclni-r-biocneighbors-1.4.2.drv': 1 dependencies couldn't be built
guix environment: error: build of `/gnu/store/xnmx8c5jgksv56g4qhsr17fsm62qclni-r-biocneighbors-1.4.2.drv' failed

What could be done? Let ask first: what is already done? What are the attempts behind this error message? Let open the build log file as reported, it reads:

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz...
download failed "https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz...
following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz'...
download failed "https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6...
download failed "https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6...
download failed "https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/...
download failed "https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/" 404 "Not Found"

Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
From https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz...
download failed "https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "NOT FOUND"
Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz...
could not find its Disarchive specification
failed to download "/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz" from ("https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" "https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz")
builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz'
build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed
View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'.

Before error out, Guix tries from 5 different locations: 2 official Bioconductor locations, 1 location maintained by the Guix project, 1 another location maintained by the NixOS project, and last 1 location from Software Heritage – the universal software archive.

Sadly, if only one item misses, all fails down. For this specific case where Bioconductor is involved, the Guix project considers it is a bug and tracks it with bug #39885.

Rewriting past origin fields

Part of this Guix bug #39885 had been fixed with b032d14ebd. Sadly, this fix is from late June 2020, hence not available for the Guix revision (April 2020) we are considering.

What could be done? We provide a channel file that totally freezes one specific state. All the packages described by that state are immutable. We are doomed, isn’t it?

Maybe not yet… Guix is flexible enough that it allows to rewrite the complete graph of dependencies. Indeed, but one should object that it will not be the exact same computational environment. And why would it not be? We need to speak about the identification of source code.

How does Guix fetch the source code? The answer is fixed-output derivation. A package as r-biocneighbors defines, among many other fields, the source field which describes how to download (url-fetch) and from where (see uri).

  (define-public r-biocneighbors
    (package
      (name "r-biocneighbors")
      (version "1.4.2")
      (source
       (origin
         (method url-fetch)
         (uri (bioconductor-uri "BiocNeighbors" version))
         (sha256
          (base32
           "1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6"))))
      (properties `((upstream-name . "BiocNeighbors")))
      (build-system r-build-system)
      (propagated-inputs
       `(("r-biocparallel" ,r-biocparallel)
[...]

In addition, we see a checksum (sha256). This field allows to verify that Guix downloads the expected content – the one that had been packaged. We could provide any other URL location (uri) while the content checksum matches.

Note. If I am extreme, the field version is a label but does not – at all – describe the source code version. Only an identifier depending on the content itself allows to exactly know which source code version we are really using.

What revision b032d14ebd from Guix bug #39885 does it fix? One of the Bioconductor URL location. Look the content is still available at another location,

$ guix time-machine -C channels.scm \
       -- download https://bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz
guile: warning: failed to install locale

Starting download of /tmp/guix-file.kQt81F
From https://bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz...
following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz'...
 …_1.4.2.tar.gz  882KiB                                                     348KiB/s 00:03 [##################] 100.0%
/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz
1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6

The checksum matches. Now, we have the source code in our local store. Problem solved for this missing source code of BiocNeighbors. Then, let re-run the command-line above for creating the computational environment. Ah, it fails again because the source code of DelayedArray is missing. Then, again for another (GenomeInfoDb), and again.

Could we programmatically extend the location where Guix downloads source code from Bioconductor? Let rewrite the origin field of the packages coming from the Bioconductor project. First, we identify these packages because the URL of their source code contains the string "bioconductor.org". Let define a predicate procedure that returns #true or #false if the uri field matches or not this string "bioconductor.org". For details about pattern-match-ing, give a look to this post or this documentation.

(define (bioconductor? p)
  (match (package-source p)
    ((? origin? o)
     (match (origin-uri o)
       ((url rest ...)
        (string-contains url "bioconductor.org"))
       (_ #false)))
    (_ #false)))

Second, Guix provides the package-mapping procedure that, given a package, applies an user-defined procedure to all the packages depended on and returns the resulting package. Other said, this package-mapping allows to customize the whole dependencies of dependencies (graph). Therefore, we need to define a procedure that takes a package and creates a new package with an extended list of URL if it is a package from Bioconductor else does nothing.

Wait, for instance the packages r-flowcore and r-ncdfflow both depends on r-bh, and even r-ncdfflow depends on r-flowcore. Therefore, we need to rewrite r-flowcore or r-bh only once and not traverse again all the dependencies. We need a predicate procedure package-seen? that returns true when the origin field of the package coming from Bioconductor had already been extended. A Guix package may own a properties field that can store any customized/user-defined pair values; from an association list. Each time we extend the list of URL, we also add an element to the properties field. Then we use it for detecting if the package had already been seen.

All in all, it reads,

(define (package-seen? pkg)
  (assq-ref (package-properties pkg) 'bioconductor))

(define (extend-url pkg)
  (cond
   ((package-seen? pkg) pkg)          ; Already processed
   ((bioconductor? pkg)               ; Process if it comes from Bioconductor
    (let ((src (package-source pkg)))
      (package
        (inherit pkg)
        (source
         (origin
           (inherit src)
           (uri (append
                 (origin-uri src)
                 (list (string-append
                        "https://" bioconductor-url "/packages/"
                        "/3.10/bioc/src/contrib/"
                        (or
                         (assq-ref (package-properties pkg) 'upstream-name)
                         (package-name pkg))
                        "_"
                        (package-version pkg) ".tar.gz"))))))
        (properties `((bioconductor . #true)
                      ,@(package-properties pkg))))))
   (else pkg)))                        ; Do nothing for all the other packages

When pkg comes from "bioconductor.org" servers, we create a new package (package) where all the fields are copied (inherit) except the source and properties fields. Since that’s a Bioconductor package, it is marked as such using the properties field. All the elements of the origin field are also copied from the original package except the uri field where the new URL is appended.

Here, the checksum field is not modified at all, therefore we are building the exact same computational environment. We are only implementing an ad-hoc workaround for fetching from more locations.

Last, in order to avoid extra work and traverse the deep dependencies of dependencies, we assume that if, for one package, the dependency is not from Bioconductor or already seen, it is not worth to process the dependencies of this dependency. This strategy is what the procedure cut? implements; see here for the details.

Launching the command-line above for creating the exact same computational environment, now we get,

building /gnu/store/0cs04zymfxpwh49z5da2ps2d4vinakhi-GenomeInfoDb_1.22.1.tar.gz.drv...
downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/GenomeInfoDb_1.22.1.tar.gz...
building /gnu/store/0k520rcg3qa4bamkgrn1x8nd1nvxbbs2-Diff-0.3.4.tar.xz.drv...
building /gnu/store/8diprmghchxf62svbapmjd2nq4g3yhhn-GenomicRanges_1.38.0.tar.gz.drv...
downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/GenomicRanges_1.38.0.tar.gz...
building /gnu/store/9i3c5vv9lmlkd0dr91252fmhzv7sa5vm-IRanges_2.20.2.tar.gz.drv...
downloading from https://bioconductor.org/packages/3.10/bioc/src/contrib/IRanges_2.20.2.tar.gz...
[...]

Awesome! Guix rocks…

Another failure

…Guix rocks but Guix cannot fix all the issues of the world. The failure now reads,

building /gnu/store/5257ay3c3qkiy51yv4frsvmjmh3hxk08-codetools_0.2-16.tar.gz.drv...
downloading from http://cran.r-project.org/src/contrib/Archive/codetools/codetools_0.2-16.tar.gz...
\sha256 hash mismatch for /gnu/store/9kd2dj46zy0m8ciz2m57f0rij9m3lj5c-codetools_0.2-16.tar.gz:
  expected hash: 00bmhzqprqfn3w6ghx7sakai6s7il8gbksfiawj8in5mbhbncypn
  actual hash:   1dklibnp747a0p41ggcf8fyw36xhj9c869gay80ggfns79y7axn2
hash mismatch for store item '/gnu/store/9kd2dj46zy0m8ciz2m57f0rij9m3lj5c-codetools_0.2-16.tar.gz'

We are doomed! Game over for reproducing the exact same computational environment as the one from our paper. The CRAN project did an in-place replacement. Again, label version is not enough when we speak about identification of source code. At package time back in 2020, codetools labelled 0.2-16 had a checksum and now this very same labelled 0.2-16 had another checksum. Sadly, because we do not have access to the past source code, now we have no mean to know the difference and/or if the difference matters. We have lost the ability to verify and audit how the computations had been done.

Assume that I just discover the result – say one, two or three years later after the publication. How can I trust the result if I cannot audit? If I stretch, what are the guarantees for trusting the result when the collective scientific method principles do not have the time to be applied? One, two or three years is not enough time for challenging a result, in my humble opinion.

That’s said, what are the options at hand? For one, we need to create the best computational environment approximation, for example by using this other version labelled 0.2-16. For two, the skeptical of the result should use this computational environment approximation to re-run and compare. Let focus on one since I am not competent at all for two.

Again, we rely on package transformation as package-input-rewriting which allows to replace dependencies. First, we need to locally define a new package, named r-codetools-bis. And second, we need to rewrite the dependencies of dependencies for replacing the old r-codetools by the new r-codetools-bis. The definition of r-codetools-bis is straightforward,

(define-module (my-pkgs-fix)
  #:use-module (guix packages)
  #:use-module (guix download)
  #:use-module ((gnu packages statistics) #:select (r-codetools)))

(define-public r-codetools-bis
  (package
    (inherit r-codetools)
    (name "r-codetools-bis")
    (source
     (origin
         (method url-fetch)
         (uri
          "http://cran.r-project.org/src/contrib/Archive/codetools/codetools_0.2-16.tar.gz")
         (sha256
          (base32 "1dklibnp747a0p41ggcf8fyw36xhj9c869gay80ggfns79y7axn2"))))))

From the origin definition of r-codetools, we just copy (inherit) all the fields except the source field, the one we replace with the new source code version matching the new checksum. And the manifest file is updated with,

(define with-r-codetools-bis
  (package-input-rewriting
   `((,(specification->package "r-codetools")
      .
      ,(specification->package "r-codetools-bis")))))

Nothing more.

Go for it

Let recap how all the pieces fit altogether. First, we have a list of package names (specification) that we transform to internal representation (package). Second, for each we rewrite the dependencies of dependencies to replace r-codetools with r-codetools-bis. Third, for each we write the dependencies of dependencies to extend the Bioconductor servers. Somehow, it reads,

(packages->manifest
 (map
  (compose
   fix-bioconductor-url
   with-r-codetools-bis
   specification->package)
  (list
   ;; Packages from Guix collection
   "r"
   "r-rtsne"
   "r-pheatmap"
   "r-rcolorbrewer"
   "r-ncdfflow"
   "r-edger"
   "r-flowcore"
   "r-dplyr"
   "r-combinat"
   "r-rmarkdown"                         ;render .Rmd files

   ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm
   "r-cydar"
   )))

In summary, we compose the three transformations and apply this composition for each (map) listed package name. See the companion Git repository for all the details.

Last, be patient… then be more patient. It needs to build many packages from source since most of the pre-built binary artifacts (substitutes) are gone. It takes many hours depending on your hardware. And the end, we do not have the exact same computational environment as the one we described in our paper. Instead, we have an approximated one where the R library codetools has been replaced. For sure, Guix provides all the tools to manipulate the computational environment controlling with very fine-grain all the details.

Next steps, work in progress

Still reading? Such a journey, isn’t it? My first opinionated and main conclusion is that Guix is very flexible and strict in the same time. The computational environment had been frozen – all is immutable – and although the reproduction is impacted by the link-rot phenomenon, still, Guix helps. Maybe it does not appear to you “easy” but I do not have in mind any other tools able of such features: time-travel, rewrite components on the fly, etc. If you are an expert of some other tools, let me know how you would do. Any feedback is very welcome!

How this reproduction could be even easier? Yes, I, among many others, are not satisfied by the current situation. We can collectively do better!

  1. If you are paper’s author, please cite without any ambiguity your scripts or more required by your work. Do not think that publishing your Git repository containing your script is enough for inspecting, verifying and auditing what you did.
    • You need to pin a specific revision when publishing.
    • You need an identifier that depends only on the content itself. And not some label version as “version 1.2.3”.
  2. If you are paper’s author, please think how to cite all the software required by your work. Do not take me wrong, publishing your scripts or more is a very good practise toward a better scientific production. In addition, please also capture the information from your computational environment as guix describe does, and publish it.

In summary,

  • The way the source code is uniquely identified matters.
  • The way how source code is transformed to binary also matters.

How to lookup inside an archive?

As we have seen above, when the content is missing at all the expected locations, then Guix tries to fetch such content from the Software Heritage archive. Currently 75% of source code that the Guix collection provides is archived. To put it in precise terms: for 75% of source code that Guix collection provides, Guix is able to automatically fetch back the content from Software Heritage. Other said, the remaining 25% may be already archived and Guix does not implement the capacity to download them from Software Heritage.

One example is about Subversion (svn) version control system. Most of CTAN TeX packages is versioned using Subversion and may be archived in Software Heritage. However, in these cases the mechanism to lookup is not implemented by Guix (see bug#43442). Help is very welcome!

Another example is about “compressed tarballs”. In short, a “compressed tarball” could be split into two parts: the content itself and metadata around. For instance, metadata is compression level, compression algorithm parameter, or related to specific file or directory structures. Software Heritage archive only the content itself – and that’s already a lot! – but drop the metadata around. Then, without these metadata, it is impossible to verify (checksum) that it exactly matches the version at package time.

That’s the purpose of Disarchive database. For all the “compressed tarballs” that the Guix collection relies on, on one hand the metadata is automatically extracted and stored in a database and on the other hand the content is archived by Software Heritage. Awesome, isn’t it?

However, as we have seen previously,

Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz...
could not find its Disarchive specification

this Disarchive database has holes. For instance, it does not contain the package r-biocneighbors for the Guix revision 1971d11db9. Help is very welcome. From discussion to example of hole or infrastructure improvement.

Last but not least, the choice of the unique identifier. It is not a detail, it is the masterpiece. And the situation seems as the well-known xkcd strip about standards. Guix had been initiated in 2012 and reuses principles pioneered by Eelco Dolstra’s PhD thesis from 2006. One is normalized archive or nar format – comparable in spirit to tarball. Then, Software Heritage designed later SoftWare Heritage persistent IDentifiers (SWHIDs) adapted for their archiving purpose. For the interested reader, an entry-point for the difference reads,

guix hash --serializer=nar --format=nix-base32 --hash=sha256
guix hash --serializer=git --format=hex        --hash=sha1

The first one matches the sha256 field in a Guix package definition. The second matches swhid:1:dir: identifier. Compare for instance,

$ guix time-machine -C channels.scm -- edit r-catterplots

$ guix hash -S nar -f nix-base32 -H sha256 \
     $(guix time-machine -C channels.scm -- build r-catterplots --source)
0qa8liylffpxgdg8xcgjar5dsvczqhn3akd4w35113hnyg1m4xyg

$ guix hash -S git -f hex -H sha1 \
     $(guix time-machine -C channels.scm -- build r-catterplots --source)
98315f49b5f8a6bd0c537de92449d5a5ce8ff35a

And visit https://archive.softwareheritage.org/swh:1:dir:98315f49b5f8a6bd0c537de92449d5a5ce8ff35a.

Somehow, Disarchive database provides such mapping from NAR identifier to SWHID identifier. It could be nice if Software Heritage could directly integrate such map. Guess what? The work in progress is to bridge both and thus ease the lookup. For instance, SWH ticket#4979 introduces some mapping components. Stay tuned for more…

Rendez-vous next year! I cannot wait. Let see if we will collectively make progress for reproducing what we just did in this post.

Join the fun, join the initiative Guix for Science!

Footnotes:

1

The paper itself incorrectly refers to Guix revision 0105f33a4d which dates from October 2021, the time of the first reproduction attempt. We then replaced by a Guix revision from 2020. The companion Git repository had then been updated but not correctly reported in the paper.


© 2014-2024 Simon Tournier <simon (at) tournier.info >

(last update: 2024-04-12 Fri 20:39)