Talk in bioinfo seminar of Pasteur Institute
Note: Thanks to Frédéric Le Moine from Pasteur Institute for the invitation to talk about Guix there. The experience was very great. Going to Pasteur Institute changed my routine, I always enjoy meet new people and learn about topics that I have barely heard before.
Although my talk (PDF) about introducing Guix is now a bit polished, its run is not as smooth as I would like, yet. First, because I still need to practise my speech delivery. Second, because explain what Guix is is difficult; it is not straightforward to clearly distinguish the “products”. Well, let make a quick recap of the message. And extend with a short example that I did not had the opportunity to demo.
The main talks’s message is also expressed in the paper:
Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data, vol. 9, num. 597, Oct. 2022
A quick opinionated summary for the impatient
- Producing a scientific result implies being transparent at all the stages, such that, all is collectively verified. Both components are essential guarantees ensuring that we build knowledge on rock-solid foundations. They implicitly asks: how to redo later and elsewhere what had been done today and here?
- Any scientific activity is by its definition open. Reproducible research is a mean – and not an end – for strengthening the trust of a scientific result.
- The redo crisis in the scientific method has many causes and not one unique solution. Part of the crisis is from the transparency of computational environments. From my totally partial point of view, it is where Guix helps. What I remind from workshop Recherche reproductible: état des lieux: computational environment troubles touch all the fields – biology, chemistry, physics, etc. even social sciences! And I let aside all the software stack required for producing raw experimental data; another story.
- In case it was not obvious: Guix does not solve the redo crisis. Neither most of the non-reproducibility problems the scientific practitioner is facing. Guix makes the computational environment transparent and collectively verifiable. Nothing more. That’s already one step toward better scientific research: Guix helps in freeing scientific practitioner’s mind about the computational environment part, then they can focus on numerical methodology and how to compose their computations.
- When considering the computational environment, Guix is one answer for the question how to redo later and elsewhere what had been done today and here?
What makes Guix different
- When I need the software
samtools
for manipulating nucleotide sequence alignments, implicitly it means I also needhtslib
– aC
library for reading/writing nucleotide sequencing data, which itself needsbzip2
orzlib
for compressing/decompressing orhtscodecs
for CRAM codecs, etc. Other said, I need a graph for linking all the dependency relationships.
- Assuming Alice says “I use
samtools
at version 1.14”, are we using the exact samesamtools
depending on the version of dependencies? Is it the samesamtools
if we link it withhtslib
at version 1.16 or if we link it withhtslib
at version 1.12? And recursively for the dependencies of dependencies…
The mention of label version as “I use
samtools
at version 1.14” is an handy shorthand for identifying source code but it does not capture all the information required for:- Checking all is correct. Label version is not fully transparent for a
collective verification:
- Based on a label version, how can we verify that the source code used by Alice is the exact same as the one we fetch now? What if two source codes are identified by the same label “version 1.14”?
- Maybe a bug had been discovered in one specific version of
htscodecs
, then how to know that the scientific result produced by Alice is not impacted by this bug if we do not know what is thehtscodecs
version that Alice used?
- Redoing if necessary. Using later the incomplete information provided
by label version, we do not have any guarantee that we run inside the
same computational environment as Alice. Then if we observe a
difference that leads to another conclusion,
- Is it because some methodological flaw of Alice’s paper?
- Is it because some experimental parameters poorly captured?
- Is it because an effect of the computational environment?
Label version as “I use
samtools
at version 1.14” is not enough when applying the scientific method. It does not allow to control the source of variations. Instead, the complete graph – required tools, their dependencies and the dependencies of dependencies – must be captured.- Checking all is correct. Label version is not fully transparent for a
collective verification:
- This graph is what any package manager builds. The topic is how to build it unambiguously. When relying on some dependency solver, satisfying all the constraints is not “easy“. Guix does not rely on any dependency solver but builds the graph from an explicit specification (state). In addition, for flexibility, Guix allows to manipulates this graph. For instance, from one specification (state), Guix allows to declare how to replace one or more nodes for customizing the computational environment.
- Using Guix, there is no dependency-resolution, contrary to Conda, APT, Yum, etc. Instead, the user specifies the state and this state provides some packages at some version. All is captured, from the exact identification of source code to compilation options, recursively.
- The state of this graph is described by
guix describe
. It provides a pin that captures the state, i.e., the whole graph.
- This features allows to reproduce the exact same stack of software from one machine to another.
Note. Container images as Docker or Singularity are a common solution to freeze this graph. The main drawback: The container embarks binaries only. The way how these binaries had been produced is lost, and if not, the task for auditing how the computational environment is composed is very hard. The source-to-binary (graph) is not designed for being deeply verifiable. Containers as Docker or Singularity lack transparency required by the scientific method.
- In addition, Guix is able to exploit Software Heritage archive. If an URL location of some source code mentioned in or required by a publication vanishes, then Guix falls back to SWH in order to checkout the missing source code.
- This fallback feature mitigates troubles coming from link-rot when time is passing. It lets a chance to redo later.
The 4 essentials for working with Guix
Capture your current state:
guix describe -f channels > state.scm
where an example of the channel file
state.scm
is displayed on the last slide of PDF.Create an isolated environment:
guix shell --container -m some-tools.scm
where examples of the manifest file
some-tools.scm
are displayed slides p.14, p.16, p.19, and second to last slide of PDF.- Collaborate or publish means share two files:
state.scm
andsome-tools.scm
. Re-create the exact same isolated environment, whenever and wherever:
guix time-machine -C state.scm \ -- shell --container -m some-tools.scm
Share the exact same computational environment via a pack – Docker is one of many other formats – if your collaborator does not run Guix (yet!):
guix time-machine -C state.scm \ -- pack -f docker -m some-tools.scm
Distribute the resulting Docker image as you wish.
Summary: The whole computational environment is captured by these both manifest and channel files. They specify one particular graph; they describe all the nodes from source code identification to how to build or compose them (see third to last slide of PDF for an example of package definition, i.e., the definition of one node).
How to work with Guix?
Do not miss A Guide to Reproducible Research papers for getting started.
Consider that we already worked on a projects but without Guix. We are going to add the two channel and manifest files. It will help for redoing.
We consider a very simple case:
- One
R
script for analysing the data. - One data set.
The example here is from analysing flow cytometry data but it adapts for any other fields using any other programming language.
As usual: a project = scripts + data
Assume that the source code for analysing the data is tracked by a Git repository. Let clone it.
git clone https://github.com/MarioniLab/SignallingMassCytoStimStrength src
Note. When speaking about redo, we need an unambiguous identifier. In the digital world, the easiest is to consider data-dependent identifier as hash fingerprint. For Git repository, we do not have to worry much. Git is designed around this concept of content-addressable identifier.
Well, let checkout one specific revision of source code scripts:
git -C src \
checkout d85402f3d951edf2c51281e3d09ea96a5c7da612
Missing the data to analyse, we need to fetch them.
Digression. For me, the right methodology is still an open question: how do we identify data set? how do we share data set? how do we fight against link-rot? etc. Am I drifting, right? Let focus on computational environment and be back on Guix!
For the sake of the message, I heavily simplify: 1. I downsampled the data and 2. I store the truncated data set in some location easily downloadable.
Warning. This data set being stored in a Git repository is not relevant here; it is only for easing the sharing with you – we are demoing Guix and not data management after all. The data set could be located on any server and downloadable via any API that such server would provide.
Let download some data to analyse.
git clone https://gitlab.com/zimoun/tiny-data src/data cd src/data && gunzip *.gz && cd ..
Now, we are ready!
Computational environment: Guix
Entering in the project directory src/
, we see the R
script files.
Consider the file Timecourse_peptides_analysis.Rmd
and it tells us that it
requires the R
library named ncdfFlow
. Let search for this package.
guix search ncdfFlow
In Guix, R
libraries are packaged with a name using the prefix r-
followed
by the upstream downcased name, as r-ncdfflow
. All in all, we identify
the requirements:
"r" "r-rtsne" "r-pheatmap" "r-rcolorbrewer" "r-ncdfflow" "r-edger" "r-flowcore" "r-dplyr" "r-combinat" "r-rmarkdown"
However, it still misses the R
library named cydar
. Guix does not provide
it. And it is not part of any known scientific Guix channels (browse). Well,
cydar
is part of Bioconductor collection, so let import it from there:
guix import cran -a bioconductor cydar
This command-line fetches metadata from Bioconductor servers, then based on
this metadata the importer returns a Guix recipe (package). The next step is
explained in “How to get started writing Guix packages” – or see another
French Café Guix talk “Comment avoir plus de paquets pour Guix ?” or this
post. The end result defining r-cydar
can be seen here.
Let write the manifest file. A starting point can be the command-line:
guix shell r r-ncdfflow r-rtsne r-pheatmap r-flowcore --export-manifest
which display a manifest file. At the end, the complete manifest.scm
file
looks as,
(specifications->manifest (list ;; Packages from Guix collection "r" "r-rtsne" "r-pheatmap" "r-rcolorbrewer" "r-ncdfflow" "r-edger" "r-flowcore" "r-dplyr" "r-combinat" "r-rmarkdown" ;render .Rmd files ;; Extend collection by defining in folder my-pkgs/my-pkgs.scm "r-cydar" ))
where I edited it for adding comments – starting with semi-colon (;
).
Then, it is easy to launch an isolated computational environment,
guix shell --container --load-path=my-pkgs -m manifest.scm
where --load-path
points a directory containing Guix package definitions
extending the builtin Guix package collection.
Note that the environment is isolated. Try some command other than R
or
Rscript
as for instance ls
or cd
.
What is missing for being able to redo later and/or elsewhere? The specification of the state.
guix describe -f channels > channels.scm
Obviously, depending when you run this command, you will potentially get another Guix revision. The one I used when writing this post is
8e61e63
and it is let as an exercise for the reader how to run this revision.
Now, if both files channels.scm
and manifest.scm
are stored with all the
other project files, it becomes easy to redo:
guix time-machine -C channels.scm \ -- shell --container -L my-pkgs -m manifest.scm \ -- Rscript -e "rmarkdown::render('Timecourse_peptides_analysis.Rmd')"
Cool, isn’t it?
When time is passing
Maybe you have noted that the scripts are not one of my project but it is the
source code associated to a published paper in 2020. Together with Nicolas
Vallet and David Michonneau, we redid part of this paper on 2021 as the demo
for our paper (Nature Scientific Data, vol. 9, num. 597, Oct. 2022).
Therefore, let take the former state (see channels.scm
here) file that we
described two years ago.
Bang!
Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz... download failed "https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz... following redirection to `https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz'... download failed "https://mghp.osn.xsede.org/bir190004-bucket01/archive.bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6... download failed "https://ci.guix.gnu.org/file/BiocNeighbors_1.4.2.tar.gz/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6... download failed "https://tarballs.nixos.org/sha256/1bx7i5pifj8w89fnhfgcfgcar2ik2ad8wqs2rix7yks90vz185i6" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/... download failed "https://archive.softwareheritage.org/api/1/content/sha256:261614fe06494f7f7acc42638e9a12338aacd873ec39685d421c49176f89a7af/raw/" 404 "Not Found" Starting download of /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz From https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz... download failed "https://web.archive.org/web/20231214082900/https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" 404 "NOT FOUND" Trying to use Disarchive to assemble /gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz... could not find its Disarchive specification failed to download "/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz" from ("https://bioconductor.org/packages/release/bioc/src/contrib/BiocNeighbors_1.4.2.tar.gz" "https://bioconductor.org/packages/3.10/bioc/src/contrib/Archive/BiocNeighbors_1.4.2.tar.gz") builder for `/gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv' failed to produce output path `/gnu/store/3zifa3x7yvmznic69j00q8qad4f588ah-BiocNeighbors_1.4.2.tar.gz' build of /gnu/store/b7x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv failed View build log at '/var/log/guix/drvs/b7/x2m61j3979mfysb9vgqih1c2qqr3nf-BiocNeighbors_1.4.2.tar.gz.drv.gz'.
Wait, let analyse the failure.
- The package
BiocNeighbors
is not part of the requirements that we specified. ThisR
library is not explicitly imported by the scripts. - We are seeing yet another example of the link-rot issue. Back on 2021,
the source code of
BiocNeighbors
was downloadable at the mentioned URL and now the content at this very same URL is gone. - Guix falls back to other locations when the initial expected one is
failing. It tries the Software Heritage archive. That’s awesome!
Currently 75% of source code that the Guix collection provides is
archived. I will not dive into details about Software Heritage coverage
and why it fails here. Keep in mind:
- Guix automatically exploits the Software Heritage archive for fighting against link-rot.
- The way the source code is uniquely identified matters.
- Stay tuned, the coverage is improving…
Sadly, if only one node of the graph misses, all fails down. For this specific case where Bioconductor is involved, the Guix project considers it is a bug and tracks it with #39885… For the fix, stay tuned!
Not convinced that Guix rocks? Give a look to many use-cases about how scientific practitioners run Guix for making more reproducible their research.
Opinionated closing
Well, computational environment and Guix is a tiny part of the large picture of Reproducible research. Transparency and the ability to collectively audit the whole stack of any computation is just the scientific method. For instance, it is not affordable to redo some intensive computations that require days or weeks on very strong clusters. Complete transparency and careful audit of all the stages – the complete software stack – are guardians for trusting the scientific method, as in this case where the numerical experiment redo is not possible.
Moreover, most of the modern analyses implies various steps that are chained. All these chained steps are often named workflow and they also lead to process a graph – see from the venerable GNU Make to Snakemake or CWL or Nextflow. The composition of these different steps (node) needs to make sense and thus how can we check all is correct? It matters when extracting some steps from one workflow for creating another workflow by composing them differently. For instance, tools as bistro or funflow are trying to tackle such topic. Then, how to connect the workflow with the software deployment? Guix Workflow Language is an attempt but lacks some strong checker. Therefore, how to join tools as bistro or funflow with fine-control of the underlying computational environment? Reproducible research in the digital world still has a lot on its plate…
Join the fun, join the initiative Guix for Science!