Data Sharing

This page provides guidance on sharing research data for publication or as required by funding agencies. Many journals have data and code availability requirements where data must be promptly available and accessible to readers upon publication.

For information about the 2023 NIH Data Management and Sharing Policy see: NIH Data Sharing

Choosing a Data Repository (NIH Guidance)

The type of data, your funder’s requirements, and your field of research will all influence which repository is right for your project.

Selecting a Data Repository - Guidelines for choosing the appropriate repository based on your data type, discipline, and funder requirements
Data management and sharing policies - Information on NIH data sharing policies and procedures for accessing shared scientific data
NIH-Supported Data Sharing Resources - a curated list of domain-specific (e.g. dbGaP, GTEx) and generalist (e.g. Zenodo, Figshare, Dryad) repositories
NCI Cancer Research Data Commons (CRDC) - NCI-funded researchers are encouraged to share their data through the CRDC, in line with NIH’s Data Management and Sharing Policy

Common Data Repositories

cBioPortal - A great way to view and access cancer genomics data
- Note: Fred Hutch has its own instance of cBioPortal
dbGaP - NIH’s database of Genotypes and Phenotypes, offering both public and controlled-access individual-level genomic data
GEO - Gene Expression Omnibus, a public functional genomics data repository for array- and sequence-based data
TCGA - The Cancer Genome Atlas, providing molecular characterization of approximately 20,000 primary cancers across 33 cancer types
GTEx - The Genotype-Tissue Expression project, studying tissue and cell-specific gene expression and regulation
1000 Genomes - A resource for genetic variants in human populations
gnomAD - The Genome Aggregation Database, aggregating and harmonizing exome and genome sequencing data across multiple studies
TOPMed - Trans-Omics for Precision Medicine, an NIH/NHLBI program focused on heart, lung, blood, and sleep (HLBS) disorders
dbSNP - Database of single nucleotide variations, microsatellites, and small-scale insertions and deletions along with population frequency and other information.
UK Biobank - Prospective cohort study with genetic and health data on 500,000 participants
Sage’s Synapse.org - Platform for sharing research data privately or publicly, hosting several open datasets and DREAM Challenges
dbGaP Specific Guidance

See the dbGaP Study Submission Guide

The NIH is committed to respecting the privacy and intentions of research participants. Data access is intended only for scientific investigators pursuing research questions consistent with informed consent agreements. Investigators must utilize appropriate controls and abide by Data Use Limitations.

NIH repositories like dbGaP provide two access levels:

Public Access: Non-individual genomic data can be publicly accessed through repository websites
Controlled Access: Individual-level data submitted to NIH repositories must be de-identified (no names or identifiable information). However, genetic fingerprints are embedded in genotype data and cannot be de-identified. Therefore, all individual-level data is distributed only through the NIH Authorized Access System

Genomic data requires special considerations due to its personal nature and unique characteristics. Genomic data:

Is often stored indefinitely
Changes in relevance over time
Carries uncertain risks
Raises privacy concerns due to re-identification risks
Can reveal unexpected health susceptibilities
Has implications for family members and reproductive decisions

Please consult with the appropriate administrative authority (e.g. an IRB) before submitting or accessing controlled-access data.

When sharing genomic and phenotypic data, investigators should:

Use informed consent documents with appropriate language regarding data sharing and future use
Share de-identified data by default
Use requested datasets solely for the research project described in the approved data request or protocol
Make no attempt to identify or contact individual participants without appropriate IRB approvals
Not distribute data to any entity or individual beyond those specified in the approved data request or protocol
Strive for harmonization of data collection and archiving methods to ensure scientific quality and validation
Adhere to computer security practices that ensure only authorized individuals can access data files and otherwise meet institutional security requirements

Questions?

Consortium members (Fred Hutch, UW, Children’s) can schedule Data House Calls for suppport:

Governance questions around research data sharing: AI and Research Data Policy Data House Call

Technical questions around research data sharing: Research Computing and Data Management Data House Call

Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.

Choosing a Data Repository (NIH Guidance)

Common Data Repositories

dbGaP Specific Guidance

Data Sharing Best Practices