Data Sharing

Updated: October 6, 2023

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

More to come here. We will develop and share guidance and information about useful tools, approaches, software and strategies for sharing data, be it as a requirement from a funding source, a publication, or during the course of your research with collaborators.

Data Sharing in the realm of genomics and large scale datasets has highlighted some specific new challenges and possibilities. The sharing of large scale research data has potential to strengthen academic medical research, the practice of medicine, and the integrity of the clinical trial system. Some benefits are obvious: when researchers have access to complete data, they can answer new questions, explore different lines of analysis, and more efficiently conduct large-scale analyses across trials or projects. However, our evolving collective understanding of data sharing practices when large-scale datasets are involved can tend to result in an unnecessary burden on the research(ers) that is actually both counterproductive and may not necessarily make the patient or researcher any safer. This section can help guide decision making and actions to successfully share and manage research data to allow for the most productivity and facilitation of the original research itself while balancing the data privacy and security needs of those involved.

Data Repositories

Public Repositories

We here provide some details on a few (by no mean exhaustive) public repositories. If you know of another repository that would like to see added to this page please contact us. Some of these repositories do require IRB approval to download the data. Please correspond with the proper administration authority.

  • GTEx: The Genotype-Tissue Expression project is aiming to understand the difference and regulation of gene expression by tissue across 53 different tissues.
  • TCGA: The Cancer Genome Atlas is a joint effort by the NCI and the NHGRI to moleculary characterize approximately 20000 primary cancer in comparison to matched normal over 33 cancer types.
  • 1000 genomes: The 1000 genomes project sequenced 1000 genomes to find genetic variants with at least 1% variation in the human population. A useful source
  • GEO: Gene expression Omnbinus is a public genomics data repository.
  • gnomAD: The genotype aggression database is working to aggregate and harmonize sequencing data across multiple studies.
  • TOPMed: TOPMed is brought to you by NIH and NHLBI. Trans-Omics for Precision Medicine (TOPmed) aims to “understand of the fundamental biological processes that underlie heart, lung, blood, and sleep (HLBS) disorders.”
  • dbsnp: Provides information about single nuclotide polymorphisms (SNPs), as well as tracking these SNPs publication history.
  • UK Biobank: Prospective cohort study with a wealth of genetic and health data on 500,000 participants.
  • cBioPortal: This is an excellent resource for viewing/accessing TCGA cancer genomics data.
  • Sage’s is not only useful for sharing (privately or publicly) your own research data, but also where several open datasets are hosted, and also interact with the various DREAM Challenges they host.

Genomic Data Sharing

As noted by the NIH and others, the nature of genomic data requires several specific considerations be kept in mind. In addition to being personal and unique to each individual, genomic data may, for example:

  • Be stored and used indefinitely.

  • Inform individuals about susceptibility to a broad range of conditions (some of which are unexpected given personal or family history).

  • Carry with them risks that are uncertain or unclear.

  • Be reinterpreted and change in relevance over time.

  • Raise privacy concerns (in part because of the risk of re-identification).

  • Be relevant for family members and reproductive decision-making.

  • Transform the personal decision of sharing genetic data into one with
  • familial current and future ramifications.

There are many challenges to sharing phenotypic and genotypic data. As such, some genomic data sharing best practices are listed below:

  • Investigator(s) should use requested datasets solely in connection with the research project described in the approved data request, protocol or other vehicle that bounds the use and disclosure of the data.

  • Investigator(s) will make no attempt to identify or contact individual participants from whom these data were collected without appropriate approvals from the relevant IRBs.

  • Investigator(s) will use informed consent documents with specific language addressing the potential persistent harms identified data can carry, the rights of patients to withdraw consent, the type of data involved and the respect and trust of participants who give researchers the right to use their data and samples for “future unspecified research.”

  • Investigator(s) will not distribute these data to any entity or individual beyond those specified in the approved data request, protocol or other vehicle which bounds the use and disclosure of the data.

  • Investigator(s) will by default share de-identified data.

  • Investigator(s) will strive to harmonization of data collection and archiving methods (storage) tools and representation to ensure validation of scientific quality.

  • Investigator(s) will adhere to computer security practices that ensure only authorized individuals can gain access to data files, minimizes unintended access and otherwise adhere to institutional security requirements.

NIH Public Genomic Data Repositories

NIH is committed to respecting the privacy and intentions of research participants with regard to how data pertaining to their individual information is used. Data access is therefore intended only for scientific investigators pursuing research questions that are consistent with the informed consent agreements provided by individual research participants. Furthermore, investigators provided access will be expected to utilize appropriate controls and abide by Data Use Limitation.

The dbGAP, for example, has two repositories into which researchers can deposit or withdraw de-identified data:

  • Public Access: Non-individual genomic data can be publicly accessed through the dbGAP website.

  • Controlled Access (Individual-level data): Individual-level data submitted to the dbGaP is required to be de-identified. No names or identifiable information is attached to the data. The genetic fingerprint however is embedded in individual’s genotype data, which is not de-identifiable. That is why, to protect individuals privacy, all individual level data is only distributed through the NIH Authorized Access System.

Additional information regarding the process of depositing data into NIH Public Repositories can be found from NCBI and from the NCI.

Updated: October 6, 2023

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.