Reference Genomes
Updated: August 4, 2023
Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.When analyzing data generated by high-throughput genome sequencing instruments, one common task is the comparison of those small genome fragments against a reference genome of a known organism. Because multiple groups may study the same organism, a collection of commonly used reference files have been made available for use in high-performance computing on the on-campus SLURM cluster.
iGenomes
The iGenomes collection of reference genomes
was developed by Illumina to provide a canonical source of reference genome information.
While that project is no longer actively updated, the available reference files have
been copied to a folder which is accessible at /shared/biodata/reference/iGenomes/
.
The structure of those files is described below, and moving forward additional
reference genome indices may be generated which follow the same format and structure.
File Structure
All files described below can be found within the directory /shared/biodata/reference/iGenomes/
:
-
The top-level folder in the iGenomes directory is named for the source organism: e.g.
/shared/biodata/reference/iGenomes/Homo_sapiens
; -
The second-level folder is named for the source of the annotation used for the reference genome: e.g.
/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC
; -
The third-level folder is named for the version of the genome: e.g.
/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19
; -
The fourth-level folders are
Annotation
andSequence
;-
The
Annotation
folder contains a folderGenes
and may also containSmallRNA
andVariation
. The exact files provided for each genome may vary, butGenes
may containgenes.bed
andgenes.gtf
to describe exon structure of genes, whileSmallRNA
may containhairpin.fa
andmature.fa
to describe miRNAs; -
The
Sequence
folder contains a folderWholeGenomeFasta
andChromosomes
which contain the FASTA sequence of the reference genome, either in full or broken out by chromosome, respectively; -
The
Sequence
folder also contains a set of references which have been pre-compiled for various alignment / analysis algorithms, which may include:BismarkIndex
BlastDB
Bowtie2Index
BowtieIndex
BWAIndex
MDSBowtieIndex
STARIndex
: For STAR < 2.7.6aSTAR2Index
: For STAR ≥ 2.7.6a
-
Example
Below is an example of the folders available for the UCSC human genome GRCh37/hg19 reference:
├── Homo_sapiens # Organism
│ └── UCSC # Source
│ ├── hg19 # Version
│ │ ├── Annotation
│ │ │ ├── Genes # Gene annotations
│ │ │ ├── README.txt
│ │ │ ├── SmallRNA # Small RNA annotations
│ │ │ └── Variation # Variant annotations
│ │ └── Sequence
│ │ ├── AbundantSequences
│ │ ├── BismarkIndex # Index for Bismark
│ │ ├── BlastDB # Index for BLAST
│ │ ├── Bowtie2Index # Index for Bowtie2
│ │ ├── BowtieIndex # Index for Bowtie
│ │ ├── BWAIndex # Index for BWA
│ │ ├── Chromosomes # Genome sequence by chromosome
│ │ ├── MDSBowtieIndex # Index for MDS Bowtie
│ │ ├── STARIndex # Index for STAR
│ │ ├── STAR2Index # Index for STAR2
│ │ └── WholeGenomeFasta # Genome sequence
RNA-Fusion Reference Database
In addition to the iGenomes references described above, the /shared/biodata/
volume also
hosts the reference files needed for the nf-core/rnafusion
analysis workflow.
The reference database was downloaded as described
with version 1.2.0
of nf-core/rnafusion
on January 18, 2022 to
/shared/biodata/reference/nfcore/rnafusion/1.2.0/2022-01-18
.
The content of that folder includes:
├── arriba/ # Reference database for the arriba tool
├── Homo_sapiens.GRCh38_r97.all.fa # Human reference genome sequence
├── Homo_sapiens.GRCh38_r97.cdna.all.fa.gz # Human reference transcript sequence
├── Homo_sapiens.GRCh38_r97.gtf # Human reference genome annotations
├── pipeline_info/ # Reference database download report
└── star-fusion/ # Reference database for the STAR-Fusion tool
Kraken2
Kraken2 is a taxonomic classification tool which is used to identify the microbes present in a complex mixture from whole-genome shotgun sequencing data. As the reference databases needed to run this tool can be quite laborious to build, a public collection of reference databases can be found at:
/shared/biodata/microbiome/kraken2
Note: This database is automatically added to the path as
KRACKEN2_DB_PATH
when loading theKraken2-2.0.7-beta-foss-2016b-Perl-5.28.0
module
Additional databases can be downloaded as-needed from the Langmead Lab Website.
CellRanger
The software produced by 10X Genomics for the analysis of single-cell sequencing data is called CellRanger. A set of reference databases provided for the analysis of 10X single cell data are available at:
/shared/biodata/ngs/Reference/10X
Note: This database is automatically added to the path as
TENX_REFDATA
when loading theCellRanger-4.0.0.eb
module
SpaceRanger
SpaceRanger
is a software suite developed by 10X Genomics (similar to CellRanger) for the analysis of Visium
Spatial Gene Expression data.
This tool can be used on the rhino/gizmo cluster by loading the SpaceRanger-1.3.0-GCC-10.2.0.eb
module.
A set of reference databases provided for the analysis of 10X Visium data are available at:
- Human:
/shared/biodata/ngs/Reference/10x/refdata-gex-GRCh38-2020-A
- Mouse:
/shared/biodata/ngs/Reference/10x/refdata-gex-mm10-2020-A
ANNOVAR
ANNOVAR is a software tool for the annotation of genomic variants. Reference databases for this tool can be found at:
/shared/biodata/humandb
GTDB-Tk
GTDB-Tk (source) is a software toolkit for
assigning objective taxonomic classifications to bacterial and archaeal genomes based
on the Genome Database Taxonomy (GTDB).
This tool can be used on the rhino/gizmo cluster by loading the
GTDBTk-0.1.3-foss-2016b-Python-3.6.7.eb
module.
Reference databases for this tool can be found at:
/shared/biodata/humand/release86
AlphaFold
AlphaFold is a powerful tool for predicting
protein structures from primary amino acid sequences.
This tool can be used on the rhino/gizmo cluster by loading the AlphaFold-2.1.1
module.
When using this tool, the environment variable ALPHAFOLD_DATA_DIR
is set
appropriately to reference the database files available at:
/shared/biodata/ngs/Reference/protein
CTAT
STAR-Fusion is a component of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT). STAR-Fusion uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads. STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set. STAR Fusion uses CTAT genome lib for human fusion transcript detection. Or users can build their own reference library. The CTAT data is from the Trinity Cancer Transcriptome Analysis Toolkit CTAT databases are downloaded from the Broad
/fh/scratch/app/CTAT
IRIS
IRIS Isoform peptides from RNA splicing for Immunotherapy target Screening. Paper GitHub IRIS reference data was downloaded from IRIS Data
/fh/scratch/app/IRIS/
Ongoing Support
To maintain the utility of this data resource, the Bioinformatics Core and the Data Core will incrementally add reference genomes for additional organisms and alignment algorithms. If your research group is interested in using an organism or alignment algorithm which is not currently available, please get in touch.
Updated: August 4, 2023
Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.