The Workflow Description Language (WDL) is an open-source language for describing data processing workflows with human-readable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution across different computing environments.

Why Use WDL?

Easy to Read and Share

WDL’s design emphasizes clarity and simplicity:

  • Human-readable syntax makes workflows easy to understand, review, and customize as necessary
  • Standardized format enables sharing across institutions and research groups
  • Reusable components through task libraries like the WILDS WDL Library

Easy to Execute

WDL separates the scientific logic (what to compute) from infrastructure details (where and how to compute):

  • No need to write SLURM submission scripts or cloud deployment code
  • Execution engines handle job scheduling, resource allocation, and data staging
  • Focus on your science, not on system administration

Reproducibility

WDL workflows explicitly define every aspect of your analysis:

  • Exact software versions through containerization (Docker/Apptainer)
  • Deterministic execution that produces identical results across different platforms
  • All required inputs and files clearly specified upfront

This means a workflow written today will produce the same results months or years later, regardless of changes to your computing environment.

Portability

Write once, run anywhere. WDL workflows can execute on:

  • Local workstations
  • Institutional HPC clusters (like Fred Hutch’s Gizmo)
  • Cloud platforms (AWS, Google Cloud, Azure)
  • Workflow platforms (Terra, DNAnexus, PROOF)

The same workflow file runs identically across all these environments without modification.

WDL Fundamentals

Let’s start with a high-level overview of WDL syntax (for more comprehensive instruction, see our online WDL course).

Structure Overview

A WDL workflow consists of three main components:

  1. Workflow: Defines the overall analysis pipeline and how tasks connect
  2. Tasks: Individual units of work (like running a specific tool)
  3. Inputs: Parameters and files needed to run the workflow

Anatomy of a Task

Here’s an example task that aligns sequencing reads using BWA:

task BwaMem {
  input {
    File input_fastq
    String base_file_name
    File ref_fasta
  }

  command <<<
    # Index reference if needed
    bwa index ~{ref_fasta}

    # Align reads
    bwa mem -p -v 3 -t 16 -M \
      ~{ref_fasta} ~{input_fastq} > ~{base_file_name}.sam

    # Convert to BAM
    samtools view -1bS -@ 15 -o ~{base_file_name}.aligned.bam ~{base_file_name}.sam
  >>>

  output {
    File output_bam = "~{base_file_name}.aligned.bam"
  }

  runtime {
    cpu: 16
    memory: "32 GB"
    docker: "getwilds/bwa:0.7.17"
  }
}

Key sections:

  • input: Files and parameters the task needs
  • command: Shell commands to execute (note the ~{variable} syntax for variable interpolation)
  • output: Files generated by the task that can be used by downstream tasks
  • runtime: Computing resources and software environment (container) required

Anatomy of a Workflow

Workflows orchestrate multiple tasks:

version 1.0

workflow AlignAndCallVariants {
  input {
    File input_fastq
    File reference_genome
    String sample_name
  }

  # Align reads
  call BwaMem {
    input:
      input_fastq = input_fastq,
      ref_fasta = reference_genome,
      base_file_name = sample_name
  }

  # Call variants
  call HaplotypeCaller {
    input:
      input_bam = BwaMem.output_bam,
      ref_fasta = reference_genome,
      sample_name = sample_name
  }

  output {
    File aligned_bam = BwaMem.output_bam
    File variants_vcf = HaplotypeCaller.output_vcf
  }
}

Key features:

  • Tasks are executed using call statements
  • Task outputs are referenced as TaskName.output_name and passed as inputs to subsequent tasks
  • Workflow inputs can be passed to multiple tasks
  • Workflow outputs define which files to keep as final results

Providing Inputs

Inputs are typically provided via JSON files:

{
  "AlignAndCallVariants.input_fastq": "/path/to/sample1.fastq.gz",
  "AlignAndCallVariants.reference_genome": "/path/to/hg38.fa",
  "AlignAndCallVariants.sample_name": "patient_001"
}

This separation allows the same workflow to run on different datasets without modifying the WDL file.

Containerization Strategies

Why Containers?

Containers ensure reproducibility by packaging software with all its dependencies:

  • Eliminates “works on my machine” problems
  • Guarantees consistent software versions
  • Simplifies complex dependency management
  • Makes workflows immediately runnable on new systems

Docker

Docker is the most common containerization platform for WDL workflows:

runtime {
  docker: "getwilds/star:2.7.6a"
}

The WILDS Docker Library provides pre-built, tested containers for common bioinformatics tools. For more on using Docker at Fred Hutch, see our Docker guide.

Apptainer (formerly Singularity)

On HPC clusters that don’t allow Docker (like Fred Hutch’s Gizmo), WDL execution engines can be configured to use Apptainer instead:

  • You still specify docker: in your WDL runtime section
  • The execution engine (when properly configured) converts Docker images to Apptainer format
  • Your workflow code remains portable - the same WDL works with both Docker and Apptainer
  • Note: This requires the execution engine to be set up with Apptainer support (platforms like PROOF handle this configuration for you)

For more details on using Apptainer at Fred Hutch, see our Apptainer guide.

Environment Modules (HPC-specific)

Some HPC systems use module systems (like EasyBuild/Lmod) for software management. At Fred Hutch, you can leverage our extensive module collection:

runtime {
  modules: "SAMtools/1.11-GCC-10.2.0"
}

However, this approach reduces portability since modules are institution-specific. For workflows you plan to share, containers are preferred. Learn more about environment modules and containers in our Computing Environments guide.

Advanced Features

Parallelization with Scatter-Gather

Process multiple samples or genomic regions in parallel:

workflow ProcessMultipleSamples {
  input {
    Array[File] sample_fastqs
    File reference_genome
  }

  scatter (fastq in sample_fastqs) {
    call AlignReads {
      input:
        input_fastq = fastq,
        ref_fasta = reference_genome
    }
  }

  output {
    Array[File] all_bams = AlignReads.output_bam
  }
}

The execution engine automatically parallelizes scattered tasks across available compute resources.

Conditional Execution

Run tasks only when certain conditions are met:

if (run_quality_control) {
  call FastQC {
    input: reads = input_fastq
  }
}

Task-level Retry Logic

Automatically retry tasks that fail due to transient issues:

runtime {
  maxRetries: 3
  docker: "getwilds/bwa:0.7.17"
}

Workflow Development Best Practices

Start Small, Scale Up

  1. Develop with small test datasets on a single sample
  2. Validate outputs are correct
  3. Scale to full datasets using scatter-gather
  4. Let the execution engine handle parallelization

Version Everything

  • Use specific container tags (not latest)
  • Specify WDL version at the top of your file: version 1.0
  • Track your workflow in version control (Git)

Modularize Your Code

  • Break complex analyses into reusable tasks
  • Import tasks from libraries like WILDS WDL Library
  • One workflow per repository for easier sharing

Document Your Work

  • Add comments explaining complex logic
  • Include example input JSON files
  • Provide a README with workflow purpose and requirements

Getting Started with WDL

Learning Resources

Official Documentation:

Fred Hutch Resources:

Community Resources:

  • WARP - Broad Institute’s WDL workflows
  • GATK Workflows - Production genomics pipelines
  • bioWDL - Bioinformatics workflow templates

Development Tools

VS Code Extension:

Validation Tools:

  • sprocket lint - Lint WDL files for best practices
  • miniwdl check - Lint and validate WDL files
  • womtool validate - Check WDL syntax without running

Running WDL Workflows

WDL workflows require an execution engine to run. See our guide on WDL Execution Engines for details on:

  • Cromwell - Best for shared HPC systems with advanced features
  • miniWDL - Lightweight, easy local execution
  • Sprocket - Modern alternative with easy setup

At Fred Hutch, you can also use PROOF for a user-friendly interface to submit WDL workflows to our cluster as well as Cirro for cloud-based execution.

Utilizing the WILDS WDL Library

The WILDS WDL Library provides tested, reusable components:

version 1.0

import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-sra/ww-sra.wdl" as sra_tasks
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-star/ww-star.wdl" as star_tasks

workflow RNAseqFromSRA {
  input {
    String sra_id
    File reference_genome
    File gene_annotations
  }

  call sra_tasks.fastqdump {
    input: sra_id = sra_id
  }

  call star_tasks.star_align {
    input:
      fastq = fastqdump.fastq,
      genome = reference_genome,
      annotations = gene_annotations
  }

  output {
    File aligned_bam = star_align.output_bam
    File gene_counts = star_align.counts
  }
}

This approach lets you build complex workflows quickly without writing every task from scratch. If there’s a tool/workflow that you think would be useful, feel free to file an issue in the WILDS WDL Library GitHub repo or reach out to us directly!

Getting Help

Fred Hutch Resources:

Broader Community:

Next Steps

  1. Learn the basics with Data Science Lab’s WDL course
  2. Explore examples in the WILDS WDL Library
  3. Choose an execution engine - see WDL Execution Engines
  4. Run your first workflow using PROOF or a local executor
  5. Join the community in #workflow-managers Slack

Updated: