Bioinformatics Tools for Microbiome Analysis

Updated: October 6, 2023

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

Data generated from microbiome experiments tends to require a distinct analytical approach, one that takes into account the many different organisms which may be present within a single biological sample. The considerations of sample preparation also tend to be fairly specific to the microbiome, e.g. 16S rRNA gene amplicon sequencing tends to be used only for the purpose of performing taxonomic identification for a mixture of bacteria.

The Microbiome Research Initiative was started at Fred Hutch to support and provide a community for scientists researching the microbiome, and part of that effort includes some work to develop a relatively standardized toolkit of bioinformatic analysis tools which can be used by multiple investigators across the center. This page is provided to host a description of analysis tools that are available, and will be updated as more are made available.

The tools below happen to use Nextflow as a system for running reproducible and portable analytical workflows. See this documentation for more details on running Nextflow at Fred Hutch, as well as the docs for more details on Nextflow itself.

For any questions on the tools presented here, either in trying to get them running on your data, or if you would like access to additional functionality (or would like to offer your own utilities), please contact Sam Minot (sminot at fredhutch.org).

Bacterial Genome Annotation

In order to work with individual microbes, it is almost essential to have a map of the genes present in their genome. The computational tool for microbial genome annotation which is most compatible with submission to public repositories is the PGAP tool developed by NCBI (link).

There are many different ways to annotate a single bacterial genome with all of the genes it is predicted to contain. PGAP is one such tool which has been produced by NCBI, and therefore retains some element of authority. Compared to other tools like Prokka, PGAP takes a very long time to run (a couple of hours per genome), but this can be mitigated for large batches of samples by processing them all concurrently using Nextflow.

We have implemented a simple workflow which enables any researcher to run the PGAP genome annotation pipeline on their own collection of bacterial genomes.

In order to run this tool, you must assemble some basic metadata describing the genome, as well as the genome sequence itself in FASTA format. You can find details on those annotation files here.

Please see the Microbial Genome Assembly workflow below if you would like to assemble a genome from raw sequence reads.

All of the details on running the PGAP workflow can be found in the GitHub repository for that workflow: https://github.com/FredHutch/PGAP-nf

Ribosomal 16S Amplicon Analysis

One of the primary tools used by microbiome researchers to detect organisms present in a microbiome sample is 16S amplicon sequencing. This technique takes advantage of a highly conserved gene present in bacterial genomes which can be targeted by PCR with well-designed primers, and then processed with high-thoughput sequencing. The analysis of 16S datasets is a highly-developed analytical process with a long series of steps, and therefore is highly amenable to automation with a formalized workflow.

Dr. Jonathan Golob is a physician-scientist at the University of Michigan, and he has developed a highly accurate and effective workflow for analyzing 16S datasets. This workflow uses dada2 to identify exact sequence varients and ultimately performs taxonomic identification using a phylogenetic approach which is very much on the cutting edge of the 16S field (using pplacer for phylogenetic assignment).

See his GitHub repository for more about this at jgolob/maliampi/.

Example Run:

All input files are specified in a single file (manifest.csv) with columns used to identify the specimen, read__1, and read__2. You may also include a batch column to indicate which samples were processed for sequencing together.

set -e

ml nextflow

REF_FOLDER="s3://fh-ctr-public-reference-data/tool_specific_data/maliampi/ya16sdb_20190821/dedup/1200bp/named/filtered"

nextflow \
    run \
    jgolob/maliampi \
    --manifest manifest.csv \
    --repo_fasta $REF_FOLDER/seqs.fasta \
    --repo_si $REF_FOLDER/seq_info.csv \
    --email <EMAIL> \
    --output <OUTPUT_FOLDER> \
    -w <WORK_DIR> \
    -resume

See the MaLiAmPi documentation for more details on running the tool and interpreting the output.

Microbial Genome Assembly

One common task in microbiology is sequencing the genome of microbial isolates. With the advent of single-molecule long-read PacBio sequencing, it is now possible to routinely generate fully closed genome assemblies. To automate this process, we have implemented the UniCycler assembler in an easy-to-use workflow. This assembler provides the advantage of accommodating both short- and long-reads, performing hybrid assembly when both are provided.

This workflow can be found in this GitHub Repository: FredHutch/unicycler-nf/

Usage:

nextflow run fredhutch/unicycler-nf <ARGUMENTS>

Arguments:
  --sample_sheet       CSV file listing samples to analyze
  --output_folder      Folder to place outputs

Options:
  --short_reads        Sample sheet contains short read data (`short_R1` and `short_R2`)
  --long_reads         Sample sheet contains long read data (`long_reads`)
  --min_fasta_length   Minimum contig length (default: 100)
  --help               Display this message

Sample Sheet:
  The sample_sheet is a CSV with a header indicating which samples correspond to which files.
  The file must contain the column `name`, and `long_reads`, `short_R1`, `short_R2` as appropriate.

Microbial Genome Circularization

The tools used for microbial genome assembly have been changing rapidly with the advent of long-read sequencing. One of the tools which has emerged in this area is Circlator, a tool to take an existing linear genome assembly and turn it into a circular assembly using information from long-read sequencing. You can refer to the publication and the documentation for more details.

Because this tool has a number of dependencies, it may be helpful to use this tool with Nextflow, which executes each step in a Docker container with all of the required dependencies. The workflow for this tool can be found in this repository, which also has all of the necessary details for running the workflow.

The only data required to run the tool is (1) a genome assembly in FASTA format, and (2) a set of long reads in either FASTQ or BAM format.

Usage:

nextflow run FredHutch/circulator-nf <ARGUMENTS>

Required Arguments:
  --manifest            File containing the location of all input genomes and reads to process
  --output_folder       Folder to place analysis outputs

Manifest:

    The manifest is a comma-separated table (CSV) with three columns, name, fasta, and reads. For example,

    name,fasta,reads
    genomeA,assemblies/genomeA.fasta.gz,pacbio_reads/genomeA.fastq.gz
    genomeB,assemblies/genomeB.fasta.gz,pacbio_reads/genomeB.fastq.gz

Microbial Pan-Genome Analysis

Microbial researchers often need to compare multiple genomes in order to identify similarities and differences. The best tool available in the community for this analysis is the anvi’o software suite developed by the Meren Lab at the University of Chicago. The anvi’o software does many things, but we wanted to provide an easy point of entry with a workflow that imports a set of bacterial genomes into the anvi’o database format, and then launches a graphical viewer which allows the user to explore their pan-genome collection.

The guidance and instructions for running this tool can be found on the GitHub repository FredHutch/nf-anvio-pangenome.

Microbial RNAseq

One application of whole-genome shotgun sequencing (WGS) for microbiome research is the analysis of microbial mixtures on the basis of what microbes are present (DNA) or are transcriptionally active (RNA). To address this analytical need, we developed an analysis tool which takes a set of WGS input data and aligns it against a set of whole microbial genomes. With the orientation towards RNAseq, the tool takes a parameter --min_cov_pct which limits the analysis to those organisms which have greater than the specified level of coverage across the rRNA genes found in their genomes. Using just those organisms, the pipeline will then measure the depth of sequencing across all genes for all organisms across all samples, and provide those results to the user in the form of a set of CSV files.

This workflow can be found in this GitHub Repository: FredHutch/microbial-rnaseq

Example Run:

All input files are specified in a single file batchfile.csv which notes each sample with the columns name and fastq (fastq1 and fastq2 for paired-end data).

nextflow \
    run \
    fredhutch/microbial-rnaseq \
    --batchfile batchfile.csv \
    --host_genome "s3://fh-ctr-public-reference-data/tool_specific_data/microbial-rnaseq/2019-06-10/Homo_sapiens_assembly38.fasta.tar" \
    --database_folder "s3://fh-ctr-public-reference-data/tool_specific_data/microbial-rnaseq/2019-07-03/" \
    --database_prefix 2019-07-03-rnaseq-database \
    --min_cov_pct 90 \
    --output_folder results/ \
    --output_prefix 2019-05-08-test \
    -work-dir work/ \
    -resume

Viral Metagenomics

When studying viruses in the human microbiome, one fruitful approach can simply be to perform whole-genome shotgun sequencing (WGS) and then align all reads against a set of viruses from some reference database. This is an extremely parallelizable process, and therefore benefits heavily from execution systems like Nextflow which can distribute tasks to cloud computing services (like AWS).

This utility takes a set of input files (FASTQ format) and aligns them all against a set of viral genomes (specified by NCBI accession in a provided CSV file).

This workflow can be found in this GitHub Repository: FredHutch/nf-viral-metagenomics

Example Run:

<INPUT_DIRECTRY> contains the set of FASTQ files to analyze <OUTPUT_DIRECTRY> is the location where all outputs will be placed <VIRAL_GENOME_CSV> is a CSV with a column accession containing the set of viral genomes to align against (as NCBI Nucleotide accessions) <NAME_OF_OUTPUT_CSV> is the name of the output file to be placed in the <OUTPUT_DIRECTORY>

nextflow \
    run \
        FredHutch/nf-viral-metagenomics \
        --input_directory <INPUT_DIRECTORY> \
        --output_directory <OUTPUT_DIRECTORY> \
        --viral_genome_csv <VIRAL_GENOME_CSV> \
        --output_csv <NAME_OF_OUTPUT_CSV>

Updated: October 6, 2023

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.