Workflow Description Language (WDL)
The Workflow Description Language (WDL) is an open-source language for describing data processing workflows with human-readable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution across different computing environments.
Why Use WDL?
Easy to Read and Share
WDL’s design emphasizes clarity and simplicity:
- Human-readable syntax makes workflows easy to understand, review, and customize as necessary
- Standardized format enables sharing across institutions and research groups
- Reusable components through task libraries like the WILDS WDL Library
Easy to Execute
WDL separates the scientific logic (what to compute) from infrastructure details (where and how to compute):
- No need to write SLURM submission scripts or cloud deployment code
- Execution engines handle job scheduling, resource allocation, and data staging
- Focus on your science, not on system administration
Reproducibility
WDL workflows explicitly define every aspect of your analysis:
- Exact software versions through containerization (Docker/Apptainer)
- Deterministic execution that produces identical results across different platforms
- All required inputs and files clearly specified upfront
This means a workflow written today will produce the same results months or years later, regardless of changes to your computing environment.
Portability
Write once, run anywhere. WDL workflows can execute on:
- Local workstations
- Institutional HPC clusters (like Fred Hutch’s Gizmo)
- Cloud platforms (AWS, Google Cloud, Azure)
- Workflow platforms (Terra, DNAnexus, PROOF)
The same workflow file runs identically across all these environments without modification.
WDL Fundamentals
Let’s start with a high-level overview of WDL syntax (for more comprehensive instruction, see our online WDL course).
Structure Overview
A WDL workflow consists of three main components:
- Workflow: Defines the overall analysis pipeline and how tasks connect
- Tasks: Individual units of work (like running a specific tool)
- Inputs: Parameters and files needed to run the workflow
Anatomy of a Task
Here’s an example task that aligns sequencing reads using BWA:
task BwaMem {
input {
File input_fastq
String base_file_name
File ref_fasta
}
command <<<
# Index reference if needed
bwa index ~{ref_fasta}
# Align reads
bwa mem -p -v 3 -t 16 -M \
~{ref_fasta} ~{input_fastq} > ~{base_file_name}.sam
# Convert to BAM
samtools view -1bS -@ 15 -o ~{base_file_name}.aligned.bam ~{base_file_name}.sam
>>>
output {
File output_bam = "~{base_file_name}.aligned.bam"
}
runtime {
cpu: 16
memory: "32 GB"
docker: "getwilds/bwa:0.7.17"
}
}
Key sections:
input: Files and parameters the task needscommand: Shell commands to execute (note the~{variable}syntax for variable interpolation)output: Files generated by the task that can be used by downstream tasksruntime: Computing resources and software environment (container) required
Anatomy of a Workflow
Workflows orchestrate multiple tasks:
version 1.0
workflow AlignAndCallVariants {
input {
File input_fastq
File reference_genome
String sample_name
}
# Align reads
call BwaMem {
input:
input_fastq = input_fastq,
ref_fasta = reference_genome,
base_file_name = sample_name
}
# Call variants
call HaplotypeCaller {
input:
input_bam = BwaMem.output_bam,
ref_fasta = reference_genome,
sample_name = sample_name
}
output {
File aligned_bam = BwaMem.output_bam
File variants_vcf = HaplotypeCaller.output_vcf
}
}
Key features:
- Tasks are executed using
callstatements - Task outputs are referenced as
TaskName.output_nameand passed as inputs to subsequent tasks - Workflow inputs can be passed to multiple tasks
- Workflow outputs define which files to keep as final results
Providing Inputs
Inputs are typically provided via JSON files:
{
"AlignAndCallVariants.input_fastq": "/path/to/sample1.fastq.gz",
"AlignAndCallVariants.reference_genome": "/path/to/hg38.fa",
"AlignAndCallVariants.sample_name": "patient_001"
}
This separation allows the same workflow to run on different datasets without modifying the WDL file.
Containerization Strategies
Why Containers?
Containers ensure reproducibility by packaging software with all its dependencies:
- Eliminates “works on my machine” problems
- Guarantees consistent software versions
- Simplifies complex dependency management
- Makes workflows immediately runnable on new systems
Docker
Docker is the most common containerization platform for WDL workflows:
runtime {
docker: "getwilds/star:2.7.6a"
}
The WILDS Docker Library provides pre-built, tested containers for common bioinformatics tools. For more on using Docker at Fred Hutch, see our Docker guide.
Apptainer (formerly Singularity)
On HPC clusters that don’t allow Docker (like Fred Hutch’s Gizmo), WDL execution engines can be configured to use Apptainer instead:
- You still specify
docker:in your WDL runtime section - The execution engine (when properly configured) converts Docker images to Apptainer format
- Your workflow code remains portable - the same WDL works with both Docker and Apptainer
- Note: This requires the execution engine to be set up with Apptainer support (platforms like PROOF handle this configuration for you)
For more details on using Apptainer at Fred Hutch, see our Apptainer guide.
Environment Modules (HPC-specific)
Some HPC systems use module systems (like EasyBuild/Lmod) for software management. At Fred Hutch, you can leverage our extensive module collection:
runtime {
modules: "SAMtools/1.11-GCC-10.2.0"
}
However, this approach reduces portability since modules are institution-specific. For workflows you plan to share, containers are preferred. Learn more about environment modules and containers in our Computing Environments guide.
Advanced Features
Parallelization with Scatter-Gather
Process multiple samples or genomic regions in parallel:
workflow ProcessMultipleSamples {
input {
Array[File] sample_fastqs
File reference_genome
}
scatter (fastq in sample_fastqs) {
call AlignReads {
input:
input_fastq = fastq,
ref_fasta = reference_genome
}
}
output {
Array[File] all_bams = AlignReads.output_bam
}
}
The execution engine automatically parallelizes scattered tasks across available compute resources.
Conditional Execution
Run tasks only when certain conditions are met:
if (run_quality_control) {
call FastQC {
input: reads = input_fastq
}
}
Task-level Retry Logic
Automatically retry tasks that fail due to transient issues:
runtime {
maxRetries: 3
docker: "getwilds/bwa:0.7.17"
}
Workflow Development Best Practices
Start Small, Scale Up
- Develop with small test datasets on a single sample
- Validate outputs are correct
- Scale to full datasets using scatter-gather
- Let the execution engine handle parallelization
Version Everything
- Use specific container tags (not
latest) - Specify WDL version at the top of your file:
version 1.0 - Track your workflow in version control (Git)
Modularize Your Code
- Break complex analyses into reusable tasks
- Import tasks from libraries like WILDS WDL Library
- One workflow per repository for easier sharing
Document Your Work
- Add comments explaining complex logic
- Include example input JSON files
- Provide a README with workflow purpose and requirements
Getting Started with WDL
Learning Resources
Official Documentation:
- OpenWDL Specification - Complete language reference
- OpenWDL Learn WDL - Tutorials and examples
- WDL Docs - Additional documentation and guides
Fred Hutch Resources:
- DaSL Developing WDL Workflows - Comprehensive course
- PROOF How To - User-friendly WDL execution platform
- PROOF Troubleshooting - Common issues and solutions using PROOF
- WILDS WDL Library - Ready-to-use tasks and workflows
- WILDS Docker Library - Pre-built containers for bioinformatics tools
Community Resources:
- WARP - Broad Institute’s WDL workflows
- GATK Workflows - Production genomics pipelines
- bioWDL - Bioinformatics workflow templates
Development Tools
VS Code Extension:
- WDL Syntax Highlighter - Syntax highlighting and error detection
- Sprocket Extension - Integrated WDL execution and debugging with Sprocket
Validation Tools:
sprocket lint- Lint WDL files for best practicesminiwdl check- Lint and validate WDL fileswomtool validate- Check WDL syntax without running
Running WDL Workflows
WDL workflows require an execution engine to run. See our guide on WDL Execution Engines for details on:
- Cromwell - Best for shared HPC systems with advanced features
- miniWDL - Lightweight, easy local execution
- Sprocket - Modern alternative with easy setup
At Fred Hutch, you can also use PROOF for a user-friendly interface to submit WDL workflows to our cluster as well as Cirro for cloud-based execution.
Utilizing the WILDS WDL Library
The WILDS WDL Library provides tested, reusable components:
version 1.0
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-sra/ww-sra.wdl" as sra_tasks
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-star/ww-star.wdl" as star_tasks
workflow RNAseqFromSRA {
input {
String sra_id
File reference_genome
File gene_annotations
}
call sra_tasks.fastqdump {
input: sra_id = sra_id
}
call star_tasks.star_align {
input:
fastq = fastqdump.fastq,
genome = reference_genome,
annotations = gene_annotations
}
output {
File aligned_bam = star_align.output_bam
File gene_counts = star_align.counts
}
}
This approach lets you build complex workflows quickly without writing every task from scratch. If there’s a tool/workflow that you think would be useful, feel free to file an issue in the WILDS WDL Library GitHub repo or reach out to us directly!
Getting Help
Fred Hutch Resources:
- WILDS Team - WDL workflow support
- #workflow-managers Slack - Community discussion
- Data House Calls - One-on-one consultations
Broader Community:
- OpenWDL Community - Language specification and governance
- Terra Support Forum - General WDL questions
- WDL Slack Channel - Community support
Next Steps
- Learn the basics with Data Science Lab’s WDL course
- Explore examples in the WILDS WDL Library
- Choose an execution engine - see WDL Execution Engines
- Run your first workflow using PROOF or a local executor
- Join the community in #workflow-managers Slack