The Data Core is one of the Fred Hutchinson Cancer Center’s Shared Resources, a centralized program of specialized cores and scientific services that supports the research process. The Data Core coordinates with other core facilities at Fred Hutch to support researchers who use large-scale datasets generated by those facilities, such as genome sequencing, mass spectrometry, high-throughput imaging, electron microscopy, and flow cytometry.
While researchers’ needs vary widely across technical domains, scientific goals, and computational complexity, the Data Core provides broad support by developing technological solutions that can be directly applied by researchers. These solutions are centered around three areas, data atlases for publishing complex datasets to a broad audience, a web-based data portal for large-scale data management and analysis, and bioinformatics workflows that help automate complex transformations of scientific data using high-performance computing systems.
Cirro Data Portal
Managing large-scale datasets produced by technical instrumentation, such as genome sequencers, flow cytometers, and microscopes, can be a challenge for researchers. Such datasets often consist of many large and files that need to be shared among collaborators, and which can often be analyzed to produce new results.
To help address these challenges, the Data Core has developed the Cirro data portal. Cirro enables researchers to efficiently manage research data, especially the data generated from the Shared Resources core facilities. It offers:
- Cost-effective storage for large datasets
- Secure data sharing among collaborators
- Standardized bioinformatics analysis pipelines
Fred Hutch researchers can log in directly to Cirro at cirro.bio.
An in-depth explanation of how Cirro can be used at Fred Hutch was provided as a seminar on July 11, 2023, and a recording of that presentation can be found on the Cirro CenterNet page.
Walkthrough videos on Cirro are provided for:
Accessing Cirro from the Command Line
The Cirro command-line interface (CLI) is available on the Fred Hutch computing
cluster as as a pre-built module
which can be loaded with the command
This can make it easier to download and work with files produced by
a Shared Resources Core Facility using the Fred Hutch computing cluster.
To install the Cirro command-line client on your local system, use
pip install cirro (requires Python).
More extensive documentation on the Cirro CLI can be found in the Cirro Documentation.
Downloading Files from Cirro
To download the files from a dataset from Cirro to a local working directory:
# Load the Cirro CLI module on a rhino/gizmo session
# Interactive prompt to select the dataset to download
cirro-cli download -i
Uploading Files to Cirro
To upload a local folder of files as a dataset to Cirro:
# Load the Cirro CLI module on a rhino/gizmo session
# Interactive prompt to select the folder to upload, and
# the destination Cirro project / dataset
cirro-cli upload -i
When uploading paired-end FASTQ datasets (e.g. from sequencing
RNA or DNA samples), it is extremely helpful to provide a samplesheet
listing any metadata available for that batch of samples.
To annotate a dataset of FASTQ files with that sample-level metadata
in Cirro, provide a samplesheet
in a file named
samplesheet.csv in the same folder as those FASTQ files
with the format below (using the example of a dataset with two sample groups,
Adding Custom Workflows (WDL / Nextflow)
In addition to a catalog of pre-configured workflows, Cirro can be used to run custom WDL or Nextflow workflows. The code for those workflows can be used from public or private GitHub repositories, including official repositories from projects like GATK.
Guidance on adding an existing WDL or Nextflow workflow to Cirro can be found in the Cirro documentation.
Need help using the Cirro platform for data management and analysis? Drop-in office hours with the Cirro team will be held every Tuesday at 2pm. Teams - Cirro Office Hours
To help present the results of complex, large-scale research projects to a general audience, the Data Core and the Data Visualization Center have developed a number of data atlases which accompany the publication of that work in peer reviewed journals. By presenting large datasets in visual terms, complex results can be conveyed in a more intuitive way to a general audience.
A current list of data atlases can be found at viz.fredhutch.org/projects/.
If you are affiliated with The Fred Hutch / University of Washington / Seattle Children’s Cancer Consortium and have an interest in using this technology for your research, we would love to hear from you.
The bioinformatic process of analyzing large datasets often requires a series of computational steps, each of which may require a different set of software dependencies. To help coordinate and streamline the execution of these workflows, researchers around the world have started to adopt a set of software tools for workflow management, such as Nextflow, Cromwell, and Snakemake.
One of the ways in which the Data Core works to provide support for bioinformatic analysis is by helping to put these workflow management tools directly into the hands of Fred Hutch researchers.
This includes assistance with running computational workflows on different computational resources (individual computers, the on-premise HPC cluster, or in the “cloud”), curation of pre-existing workflows for commonly used tasks, and assistance with the development of novel workflows to implement new scientific ideas.
Our Workflow Resources include:
- Guidance for running automated workflows on Fred Hutch HPC resources (SLURM and AWS)
- A catalog of curated bioinformatics workflows (e.g. RNAseq, pan-genome analysis)
- Building your own automated workflows (e.g., from existing BASH scripts)
If you have any questions about using automated workflows for your research, please don’t hesitate to get in touch.
The process of analyzing datasets generated for a particular experiment or project can be complex, often requiring deep expertise in the technology used to generate the raw data as well as the computational tools needed to process them. The Bioinformatics Core provides researchers with support for this analysis, engaging on the basis of specific projects.
The staff of the Bioinformatics Core are available by appointment for one-on-one consultation. They provide support to researchers for experimental design, directly analyzing complex datasets, consultation on choice of data analysis strategies and software tools, or to help with advice and troubleshooting as you conduct your own analyses.
We strongly encourage researchers to consult with a bioinformatics specialist at the earliest stages of a project to ensure an appropriate experimental design is in place prior to seeking data analysis support.
While there are many resources available online for building skills in computational analysis of complex datasets, it often be difficult for researchers to know where to start or what approaches will be the most useful. To help provide some structure for researcher-driven skills development, we work to provide a useful compendium of self-directed training resources.
For more information, browse our information resources for learning opportunities.
The Data Core and Bioinformatics Core maintain a core set of data resources which are accessible to the entire Fred Hutch community. This collection includes frequently used reference genomes that are available for high-performance computing on the shared file system. If you have suggestions for additional data resources that could benefit multiple research groups, please contact the Data Core.
For more information, browse our documentation of the iGenomes reference genomes hosted in
Updated: October 13, 2023Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.