In this page, we talk about reproducible best practices to use software on the FH cluster (rhino
/ gizmo
). If you’re getting started on rhino
/ gizmo
, this page is for you.
After reading this page, you will be able to
rhino
/gizmo
and integrate them into your scriptsconda
to install software on rhino
/gizmo
to your home directoryrhino
/ gizmo
like my laptop?The FH computational cluster is a shared resource. It needs to be maintained and work for a variety of users. Giving individual users root access is not advisable because it is a shared resource.
Source: [The b(ack)log | A nice picture of (dependency) hell (thebacklog.net)](https://www.thebacklog.net/2011/04/04/a-nice-picture-of-dependency-hell/) |
Motivation: avoid dependency hell. Often, different software executables are dependent on different versions of software packages (for example, one software package may require a different gcc
version to compile than another package.) You may have come across this when trying to run a python
package that requires an earlier version of python
.
What is the overall strategy to avoid the dependency nightmare? Where possible, we need a separate software environment for each step of an analysis (another way to look at it is to bundle the software and its dependencies with versions together). When we’re done with one executable in a workflow, we should unload it and then open the next software environment. Here’s a good read about software environments and why you should care.
That said, you have options when you need to run software on rhino
/ gizmo
. Let’s talk about the order which you should use for finding and running software.
What is a software environment? A software environment includes the software that you’re running (such as cromwell
) and its dependencies that are needed to run (such as certain versions of Java). You’ve already used a software environment on your own machine. It’s a really big one, with lots of dependencies.
For example, on my laptop I installed a Java Software Development Kit (SDK) so I could run Cromwell, and that is one part of the big software environment, including python and R.
The problem comes when another software package needs a different version of the Java SDK. We’d have switch a lot of things around to make it possible to run it.
This is where the idea of isolated software environments can be useful. We can bundle our two different applications with the different Java SDKs. That way, our two different software packages can run without dependency conflicts.
We’ll focus on standalone executables, such as samtools
or bwa mem
. Click on the links to jump to that section. The order in which you should try to run software:
module avail
, such as module avail samtools
)That is, use the module load
command to load your module, and run it. Why: Scientific Computing spends time optimizing the modules to run well on rhino
/ gizmo
, and they’ve solved the dependency nightmare for you.conda
? You’ll install software with conda env
and conda install
. Why: using conda env
will isolate your software environments, avoiding the dependency nightmare.make
.The next section talks about basic best practices when running software using each of these methods.
module avail <module-name>
, such as module avail samtools
.biocontainers/samtools:v1.9-4-deb_cv1
lmod
(what you use when you do module load
is here)module avail
. Find a specific version number (if you don’t know, pick the latest version). For example: module avail samtools
source /app/lmod/lmod/init/profile
module load
when about to use software in your script (use specific version numbers). For example: module load SAMtools/0.1.20-foss-2018b
module load
and check to see that you can see the executable using which
. For example: which samtools
should return the path where this binary is installed (/app/software/SAMtools/0.1.20-foss-2018b/bin/samtools
).samtools view -c myfile_bam > counts.txt
module unload
your module or module purge
all modules when done using them and starting a new step. Try not to load all modules all at once at beginning of script. For example: module unload SAMtools/0.1.20-foss-2018b
Here’s an example illustrating this:
#!/bin/bash
module load SAMtools/0.1.20-foss-2018b
samtools view -c $1 > counts.txt
module purge
One of the best introduction to containers is at the Turing Way. There are three options to run containers: Option 1 can be directly applied in your script, wherea Option 2 requires knowledge of WDL
. Option 3 is to use Nextflow to launch your analysis.
Run software in separate isolated containers rather than a monolithic environment to avoid dependency nightmares in your bash script
apptainer
using module load Apptainer/1.1.6
(or the latest version.)apptainer pull
to pull a container (note that we recommend using Docker containers). For example, apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
run
a container on your data. For example: apptainer run docker://biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c my_bam_file.bam > counts.txt
will run samtools view
and count the number of reads in your file.We recommend using grabnode
to grab a gizmo
node and test your script in interactive mode (for example: apptainer shell docker://biocontainers/samtools:v1.9-4-deb_cv1
). This makes things a lot easier to test the software in the container. If you don’t understand how the software is set up in the container, you can open an interactive shell into the container to test things out.
Here’s an example script:
#!/bin/bash
module load Apptainer/1.1.6
# script assumes you've already pulled the apptainer container with a
# command such as:
# apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
apptainer run docker://biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c my_bam_file.bam > counts.txt
module purge
Use Workflow Description Language (WDL) or Nextflow to orchestrate running containers on your data. The execution engine (such as miniWDL, Cromwell, or Nextflow) will handle loading the container as different tasks are run in your WDL script.
There is a graphical user interface to running WDL scripts on your data on gizmo
called PROOF.
If you want to learn how to write your own WDL files, then we have the Developing WDL Workflows guide available.
Use Nextflow to run your software. Requires knowledge of Nextflow and Nextflow workflows. For more info on running Nextflow at Fred Hutch, check this link out.
Anaconda
environment module to load conda
. For example, ml Anaconda3/2023.09-0
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Your `~/.condarc” (conda config) file should look like this:
channels:
- conda-forge
- bioconda
- defaults
channel_priority: strict
Note that some channels (like defaults
) require a subscription, but bioconda
and conda-forge
do not. You can remove the - defaults
line to avoid this.
conda create
to install your software to the home file system in its own environment. For example: conda create --name samtools_env samtools=1.19.2-1
will install samtools
into an environment called samtools_env
. Try to limit installing more than one utility into each environment to avoid the dependency nightmare.Anaconda
environment module: ml Anaconda3/2023.09-0
conda activate
. For example: conda activate samtools_env
which
: which samtools
. This should return a path to where your samtools
binary is installed.conda deactivate
the conda
environment. For example: conda deactivate
.Here’s an example script, assuming you have installed samtools
into an environment called samtools_env
:
#!/bin/bash
ml Anaconda3/2023.09-0
conda activate samtools_env
samtools view -c $1 > counts.txt
conda deactivate
module purge
ml foss/2023b
.$PATH
in your .bashrc
fileYour future self (and future lab mates) will thank you for taking the time to disentangle the code and solve the dependency nightmare once and for all. If you need help, be sure to schedule a Data House Call with the Data Science Lab and Join the Fred Hutch Data Slack and join the #workflow-managers channel.
For community support (you’re not doing this alone!), consider joining the Research Informatics Community Studios.
How can we make a script that uses environment modules to be more reproducible? We can isolate the modules by loading them only when we’re using them in a script.
Here’s an example where we are running multiqc
and bwa mem
on a fasta file.
#!/bin/bash
# usage: bash combo_script1 $1
# where $1 is the path to a FASTA file
# Load relevant modules
module load MultiQC/1.21-foss-2023a-Python-3.11.3
module load BWA/0.7.17-GCCcore-12.2.0
#run multiqc
multiqc $1
#run run bwa-mem
bwa mem -o $1.mem.sam $1 ../reference/hg38.fasta $1
The more isolated software environment approach is we load and unload modules as we use them:
#!/bin/bash
#load multiqc first - first task
module load MultiQC/1.21-foss-2023a-Python-3.11.3
multiqc $1
#remove modules for next step
module purge
#load bwa - second task
module load BWA/0.7.17-GCCcore-12.2.0
bwa -o $1.mem.sam $1 ../reference/hg38.fasta $1
module purge
This also has the advantage of being the first step in transforming your script to WDL. Each of these module load
/ module purge
sections is basically a task in WDL.
This assumes that nothing similar exists in WILDS WDL Workflows. If something does, then fork that workflow and start from there.
runtime
block, use docker: mycontainer:version_number
to specify your docker container. Note: latest
is not a version number.module load
and module purge
in the script block for each task, and make sure to specify a version number.