In this page, we talk about reproducible best practices to use software on the FH cluster (rhino / gizmo). If you’re getting started on rhino / gizmo, this page is for you.
After reading this page, you will be able to
rhino/gizmo and integrate them into your scriptsconda to install software on rhino/gizmo to your home directoryrhino/ gizmo like my laptop?The FH computational cluster is a shared resource. It needs to be maintained and work for a variety of users. Giving individual users root access is not advisable because it is a shared resource.

| Source: [The b(ack)log | A nice picture of (dependency) hell (thebacklog.net)](https://www.thebacklog.net/2011/04/04/a-nice-picture-of-dependency-hell/) |
Motivation: avoid dependency hell. Often, different software executables are dependent on different versions of software packages (for example, one software package may require a different gcc version to compile than another package.) You may have come across this when trying to run a python package that requires an earlier version of python.
What is the overall strategy to avoid the dependency nightmare? Where possible, we need a separate software environment for each step of an analysis (another way to look at it is to bundle the software and its dependencies with versions together). When we’re done with one executable in a workflow, we should unload it and then open the next software environment. Here’s a good read about software environments and why you should care.
That said, you have options when you need to run software on rhino/ gizmo. Let’s talk about the order which you should use for finding and running software.
What is a software environment? A software environment includes the software that you’re running (such as cromwell) and its dependencies that are needed to run (such as certain versions of Java). You’ve already used a software environment on your own machine. It’s a really big one, with lots of dependencies.
For example, on my laptop I installed a Java Software Development Kit (SDK) so I could run Cromwell, and that is one part of the big software environment, including python and R.
The problem comes when another software package needs a different version of the Java SDK. We’d have switch a lot of things around to make it possible to run it.
This is where the idea of isolated software environments can be useful. We can bundle our two different applications with the different Java SDKs. That way, our two different software packages can run without dependency conflicts.

We’ll focus on standalone executables, such as samtools or bwa mem. Click on the links to jump to that section. The order in which you should try to run software:
module avail, such as module avail samtools)That is, use the module load command to load your module, and run it. Why: Scientific Computing spends time optimizing the modules to run well on rhino/ gizmo, and they’ve solved the dependency nightmare for you.conda? You’ll install software with conda env and conda install. Why: using conda env will isolate your software environments, avoiding the dependency nightmare.make.The next section talks about basic best practices when running software using each of these methods.
module avail <module-name>, such as module avail samtools.biocontainers/samtools:v1.9-4-deb_cv1lmod (what you use when you do module load is here)module avail. Find a specific version number (if you don’t know, pick the latest version). For example: module avail samtoolssource /app/lmod/lmod/init/profilemodule load when about to use software in your script (use specific version numbers). For example: module load SAMtools/0.1.20-foss-2018bmodule load and check to see that you can see the executable using which. For example: which samtools should return the path where this binary is installed (/app/software/SAMtools/0.1.20-foss-2018b/bin/samtools).samtools view -c myfile_bam > counts.txtmodule unload your module or module purge all modules when done using them and starting a new step. Try not to load all modules all at once at beginning of script. For example: module unload SAMtools/0.1.20-foss-2018bHere’s an example illustrating this:
#!/bin/bash
module load SAMtools/0.1.20-foss-2018b
samtools view -c $1 > counts.txt
module purge
One of the best introduction to containers is at the Turing Way. There are three options to run containers: Option 1 can be directly applied in your script, wherea Option 2 requires knowledge of WDL. Option 3 is to use Nextflow to launch your analysis.
Run software in separate isolated containers rather than a monolithic environment to avoid dependency nightmares in your bash script
apptainer using module load Apptainer/1.1.6 (or the latest version.)apptainer pull to pull a container (note that we recommend using Docker containers). For example, apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1run a container on your data. For example: apptainer run docker://biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c my_bam_file.bam > counts.txt will run samtools view and count the number of reads in your file.We recommend using grabnode to grab a gizmo node and test your script in interactive mode (for example: apptainer shell docker://biocontainers/samtools:v1.9-4-deb_cv1). This makes things a lot easier to test the software in the container. If you don’t understand how the software is set up in the container, you can open an interactive shell into the container to test things out.
Here’s an example script:
#!/bin/bash
module load Apptainer/1.1.6
# script assumes you've already pulled the apptainer container with a
# command such as:
# apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
apptainer run docker://biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c my_bam_file.bam > counts.txt
module purge
Use Workflow Description Language (WDL) or Nextflow to orchestrate running containers on your data. The execution engine (such as miniWDL, Cromwell, or Nextflow) will handle loading the container as different tasks are run in your WDL script.
There is a graphical user interface to running WDL scripts on your data on gizmo called PROOF.
If you want to learn how to write your own WDL files, then we have the Developing WDL Workflows guide available.
Use Nextflow to run your software. Requires knowledge of Nextflow and Nextflow workflows. For more info on running Nextflow at Fred Hutch, check this link out.
Miniforge3 environment module to load conda. For example, ml Miniforge3conda create to install your software to the home file system in its own environment. For example: conda create --name samtools_env samtools=1.19.2-1 will install samtools into an environment called samtools_env. Try to limit installing more than one utility into each environment to avoid the dependency nightmare.Miniforge3 environment module: ml Miniforge3conda activate. For example: conda activate samtools_envwhich : which samtools. This should return a path to where your samtools binary is installed.conda deactivate the conda environment. For example: conda deactivate.Here’s an example script, assuming you have installed samtools into an environment called samtools_env:
#!/bin/bash
ml Miniforge3
conda activate samtools_env
samtools view -c $1 > counts.txt
conda deactivate
module purge
ml foss/2023b.$PATH in your .bashrc fileYour future self (and future lab mates) will thank you for taking the time to disentangle the code and solve the dependency nightmare once and for all. If you need help, be sure to schedule a Data House Call with the Data Science Lab and Join the Fred Hutch Data Slack and join the #workflow-managers channel.
For community support (you’re not doing this alone!), consider joining the Research Informatics Community Studios.
How can we make a script that uses environment modules to be more reproducible? We can isolate the modules by loading them only when we’re using them in a script.
Here’s an example where we are running multiqc and bwa mem on a fasta file.
#!/bin/bash
# usage: bash combo_script1 $1
# where $1 is the path to a FASTA file
# Load relevant modules
module load MultiQC/1.21-foss-2023a-Python-3.11.3
module load BWA/0.7.17-GCCcore-12.2.0
#run multiqc
multiqc $1
#run run bwa-mem
bwa mem -o $1.mem.sam $1 ../reference/hg38.fasta $1
The more isolated software environment approach is we load and unload modules as we use them:
#!/bin/bash
#load multiqc first - first task
module load MultiQC/1.21-foss-2023a-Python-3.11.3
multiqc $1
#remove modules for next step
module purge
#load bwa - second task
module load BWA/0.7.17-GCCcore-12.2.0
bwa -o $1.mem.sam $1 ../reference/hg38.fasta $1
module purge
This also has the advantage of being the first step in transforming your script to WDL. Each of these module load/ module purge sections is basically a task in WDL.
This assumes that nothing similar exists in WILDS WDL Workflows. If something does, then fork that workflow and start from there.
runtime block, use docker: mycontainer:version_number to specify your docker container. Note: latest is not a version number.module load and module purge in the script block for each task, and make sure to specify a version number.