Workflow Description Language (WDL)

The Workflow Description Language (WDL) is an open-source language for describing data processing workflows with human-readable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution across different computing environments.

Why Use WDL?

WDL’s design emphasizes clarity and simplicity:

Human-readable syntax makes workflows easy to understand, review, and customize as necessary
Standardized format enables sharing across institutions and research groups
Reusable components through task libraries like the WILDS WDL Library

Easy to Execute

WDL separates the scientific logic (what to compute) from infrastructure details (where and how to compute):

No need to write SLURM submission scripts or cloud deployment code
Execution engines handle job scheduling, resource allocation, and data staging
Focus on your science, not on system administration

Reproducibility

WDL workflows explicitly define every aspect of your analysis:

Exact software versions through containerization (Docker/Apptainer)
Deterministic execution that produces identical results across different platforms
All required inputs and files clearly specified upfront

This means a workflow written today will produce the same results months or years later, regardless of changes to your computing environment.

Portability

Write once, run anywhere. WDL workflows can execute on:

Local workstations
Institutional HPC clusters (like Fred Hutch’s Gizmo)
Cloud platforms (AWS, Google Cloud, Azure)
Workflow platforms (Terra, DNAnexus, PROOF)

The same workflow file runs identically across all these environments without modification.

WDL Fundamentals

Let’s start with a high-level overview of WDL syntax (for more comprehensive instruction, see our online WDL course).

Structure Overview

A WDL workflow consists of three main components:

Workflow: Defines the overall analysis pipeline and how tasks connect
Tasks: Individual units of work (like running a specific tool)
Inputs: Parameters and files needed to run the workflow

Anatomy of a Task

Here’s an example task that aligns sequencing reads using BWA:

task BwaMem {
  input {
    File input_fastq
    String base_file_name
    File ref_fasta
  }

  command <<<
    # Index reference if needed
    bwa index ~{ref_fasta}

    # Align reads
    bwa mem -p -v 3 -t 16 -M \
      ~{ref_fasta} ~{input_fastq} > ~{base_file_name}.sam

    # Convert to BAM
    samtools view -1bS -@ 15 -o ~{base_file_name}.aligned.bam ~{base_file_name}.sam
  >>>

  output {
    File output_bam = "~{base_file_name}.aligned.bam"
  }

  runtime {
    cpu: 16
    memory: "32 GB"
    docker: "getwilds/bwa:0.7.17"
  }
}

Key sections:

input: Files and parameters the task needs
command: Shell commands to execute (note the ~{variable} syntax for variable interpolation)
output: Files generated by the task that can be used by downstream tasks
runtime: Computing resources and software environment (container) required

Anatomy of a Workflow

Workflows orchestrate multiple tasks:

version 1.0

workflow AlignAndCallVariants {
  input {
    File input_fastq
    File reference_genome
    String sample_name
  }

  # Align reads
  call BwaMem {
    input:
      input_fastq = input_fastq,
      ref_fasta = reference_genome,
      base_file_name = sample_name
  }

  # Call variants
  call HaplotypeCaller {
    input:
      input_bam = BwaMem.output_bam,
      ref_fasta = reference_genome,
      sample_name = sample_name
  }

  output {
    File aligned_bam = BwaMem.output_bam
    File variants_vcf = HaplotypeCaller.output_vcf
  }
}

Key features:

Tasks are executed using call statements
Task outputs are referenced as TaskName.output_name and passed as inputs to subsequent tasks
Workflow inputs can be passed to multiple tasks
Workflow outputs define which files to keep as final results

Providing Inputs

Inputs are typically provided via JSON files:

{
  "AlignAndCallVariants.input_fastq": "/path/to/sample1.fastq.gz",
  "AlignAndCallVariants.reference_genome": "/path/to/hg38.fa",
  "AlignAndCallVariants.sample_name": "patient_001"
}

This separation allows the same workflow to run on different datasets without modifying the WDL file.

Containerization Strategies

Why Containers?

Containers ensure reproducibility by packaging software with all its dependencies:

Eliminates “works on my machine” problems
Guarantees consistent software versions
Simplifies complex dependency management
Makes workflows immediately runnable on new systems

Docker

Docker is the most common containerization platform for WDL workflows:

runtime {
  docker: "getwilds/star:2.7.6a"
}

The WILDS Docker Library provides pre-built, tested containers for common bioinformatics tools. For more detail, see the “Use Docker in WDL workflows” section of our Docker page.

Apptainer (formerly Singularity)

On HPC clusters that don’t allow Docker (like Fred Hutch’s Gizmo), WDL execution engines can be configured to use Apptainer instead:

You still specify docker: in your WDL runtime section
The execution engine (when properly configured) converts Docker images to Apptainer format
Your workflow code remains portable - the same WDL works with both Docker and Apptainer
Note: This requires the execution engine to be set up with Apptainer support (platforms like PROOF handle this configuration for you)

For more details on using Apptainer at Fred Hutch, see our Apptainer guide.

Environment Modules (HPC-specific)

Some HPC systems use module systems (like EasyBuild/Lmod) for software management. At Fred Hutch, you can leverage our extensive module collection:

runtime {
  modules: "SAMtools/1.11-GCC-10.2.0"
}

However, this approach reduces portability since modules are institution-specific. For workflows you plan to share, containers are preferred. Learn more about environment modules and containers in our Computing Environments guide.

Advanced Features

Parallelization with Scatter-Gather

Process multiple samples or genomic regions in parallel:

workflow ProcessMultipleSamples {
  input {
    Array[File] sample_fastqs
    File reference_genome
  }

  scatter (fastq in sample_fastqs) {
    call AlignReads {
      input:
        input_fastq = fastq,
        ref_fasta = reference_genome
    }
  }

  output {
    Array[File] all_bams = AlignReads.output_bam
  }
}

The execution engine automatically parallelizes scattered tasks across available compute resources.

Conditional Execution

Run tasks only when certain conditions are met:

if (run_quality_control) {
  call FastQC {
    input: reads = input_fastq
  }
}

Task-level Retry Logic

Automatically retry tasks that fail due to transient issues:

runtime {
  maxRetries: 3
  docker: "getwilds/bwa:0.7.17"
}

Workflow Development Best Practices

Start Small, Scale Up

Develop with small test datasets on a single sample
Validate outputs are correct
Scale to full datasets using scatter-gather
Let the execution engine handle parallelization

Version Everything

Use specific container tags (not latest)
Specify WDL version at the top of your file: version 1.0
Track your workflow in version control (Git)

Modularize Your Code

Break complex analyses into reusable tasks
Import tasks from libraries like WILDS WDL Library
One workflow per repository for easier sharing

Document Your Work

Add comments explaining complex logic
Include example input JSON files
Provide a README with workflow purpose and requirements

Getting Started with WDL

Learning Resources

Official Documentation:

OpenWDL Specification - Complete language reference
OpenWDL Learn WDL - Tutorials and examples
WDL Docs - Additional documentation and guides

Fred Hutch Resources:

DaSL Developing WDL Workflows - Comprehensive course
PROOF How To - User-friendly WDL execution platform
PROOF Troubleshooting - Common issues and solutions using PROOF
WILDS WDL Library - Ready-to-use tasks and workflows
WILDS Docker Library - Pre-built containers for bioinformatics tools
Using Docker at Fred Hutch

Community Resources:

WARP - Broad Institute’s WDL workflows
GATK Workflows - Production genomics pipelines
bioWDL - Bioinformatics workflow templates

Development Tools

VS Code Extension:

WDL Syntax Highlighter - Syntax highlighting and error detection
Sprocket Extension - Integrated WDL execution and debugging with Sprocket

Validation Tools:

sprocket lint - Lint WDL files for best practices
miniwdl check - Lint and validate WDL files
womtool validate - Check WDL syntax without running

Running WDL Workflows

WDL workflows require an execution engine to run. See our guide on WDL Execution Engines for details on:

Cromwell - Best for shared HPC systems with advanced features
miniWDL - Lightweight, easy local execution
Sprocket - Modern alternative with easy setup

At Fred Hutch, you can also use PROOF for a user-friendly interface to submit WDL workflows to our cluster as well as Cirro for cloud-based execution.

Utilizing the WILDS WDL Library

The WILDS WDL Library provides tested, reusable components:

version 1.0

import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-sra/ww-sra.wdl" as sra_tasks
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-star/ww-star.wdl" as star_tasks

workflow RNAseqFromSRA {
  input {
    String sra_id
    File reference_genome
    File gene_annotations
  }

  call sra_tasks.fastqdump {
    input: sra_id = sra_id
  }

  call star_tasks.star_align {
    input:
      fastq = fastqdump.fastq,
      genome = reference_genome,
      annotations = gene_annotations
  }

  output {
    File aligned_bam = star_align.output_bam
    File gene_counts = star_align.counts
  }
}

This approach lets you build complex workflows quickly without writing every task from scratch. If there’s a tool/workflow that you think would be useful, feel free to file an issue in the WILDS WDL Library GitHub repo or reach out to us directly!

Getting Help