Data Sharing and Public Repositories

Updated: April 17, 2020

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

Whether required for funding and publication or desired for its beneficial impact on research progress, understanding the why and how of data sharing is essential for modern biomedical research programs.

Benefits of Data Sharing

Some of the benefits of data sharing include:

  • Reinforcing open scientific inquiry

  • Encouraging diversity of analysis and opinion

  • Promoting new research

  • Expanding testing capabilities of both new and alternative hypotheses and methods of analysis

  • Supporting studies on data collection methods and measurement

  • Facilitating education of new researchers

  • Enabling the exploration of topics not envisioned by the initial investigators

  • Permitting the creation of new data sets by combining data from multiple sources

  • Gaining for individual investigators additional insights from other investigators’ studies of the data

  • Increasing the visibility and credibility of one's own research based on the data

  • Opening opportunities for developing new collaborations and for access to complementary data sets

Data Sharing and Management Plans

A good resource for data sharing management plan development is DMPTool sponsored by the University of California. DMPTool lets data owners create, review and share data management plans and provides ancillary knowledge to support sharing and management.

The following should be considered when developing a data sharing plan:

  1. Data Description: What data will be generated? How will you create the data? (simulated, observed, experimental, software, physical collections)

  2. Existing Data: Will you be using existing data? What is the relationship between the data you are collecting and existing data?

  3. Audience: Who will potentially use the data?

  4. Confidentiality: What coding, systems, and gatekeeping procedures are in place to protect the confidentiality of the data subjects?

  5. Access and Sharing: How will data files be shared? How will others access them?

  6. Formats: What data formats will you be creating?

  7. Metadata and Documentation: What documentation will you provide to describe the data? Metadata formats and standards?

  8. Storage, backup, replication, versioning: Are the data files backed up regularly? Are there replicas in different locations? Are older versions of the data kept?

  9. Security: Are the system and storage that will be used secure?

  10. Budget: Any costs for preparing the data? Costs for storage and long-term access?

  11. Privacy, Intellectual Property: Does the data contain private or confidential information? Any copyrights?

  12. Archiving, Preservation, Long-term Access: What plans do you have to archive the data and other research products? Will it have long-term accessibility?

  13. Adherence: How will you check for adherence of this plan?

Reading Material

Data Sharing Responsibilities

The context of a data sharing project can mandate certain safety requirements and restrictions. Some of these considerations may include whether the data is shared pre- or post- publication, or whether the data is shared locally within the Fred Hutch Cancer Consortium or to a larger national or international audience.

While the underlying goal of sharing is to further research and its benefits, the underlying tenets of data sharing processes are to ensure:

  • Identification of who shall have access to the data

  • Data sharing does not compromise individual subjects’ rights and privacy, regardless of whether the data have been used in a publication

  • Data shared is restricted to only that data appropriate for a specific line of inquiry

  • The integrity and quality of shared data are preserved

  • Data sharing is done within the legal requirements of both sender and receiver

  • Shared data is easily readable or accessible by the receiver. Consideration can be made whether metadata should be provided along with the data to make it easily understood

  • Appropriate intellectual property, copyright, licensing issues are considered prior to sharing

Before sharing data, it is prudent to determine if sharing is permitted. Below are some relevant questions to help make this determination:

  • If the data is derived from human subjects research, does the associated IRB-approved informed consent (or waiver of informed consent) permit disclosure for the contemplated DUA purpose? If not, new IRB review, and a waiver of consent or re-consenting of subjects, may be required before sharing is permitted.

  • If the data was collected pursuant to a sponsored research project, has the sponsor placed restrictions on the subsequent transfer of the data? Stipulations may prevent sharing or require specific sharing restrictions.

  • If the data was initially received from, or derived from, data received from a third party pursuant to a contract, does that contract place restrictions on the subsequent transfer of the data? Stipulations may prevent sharing or require specific sharing restrictions.

Required Data Sharing

HHS developed a final rule, Clinical Trials Registration and Results Information Submission, made publicly available on September 16, 2016; and the NIH issued a complementary final policy, under which NIH-funded awardees and investigators are required to submit registration and results information for all NIH-funded clinical trials, whether or not the trials are covered by the FDAAA requirements.

Funding Agencies

Many (but not all) federal funding agencies require data management plans and data sharing plans as part of a grant proposal application. Persons seeking to make a grant application to a federal funding source should ascertain the data sharing plan requirements.

Some of the US federal granting agencies that require funded projects to provide some form of data management planning include NIH, NSF, DOE, DOD, HHS, and FDA.

Funding agencies may want to know:

  • What data you are producing that will be accessible (shared) with others?

  • When and how will you make it discoverable? Place in a repository, data commons or other location so others can find it?

  • How will you make accessible? Will there be restrictions to some or all the data and if so how do people obtain access?

  • How will you make it useable/reusable? Will there be documentation, definitions, descriptions of methodology, use of standard terminology, etc.

Data Sharing as Part of Publications

As of 1 July 2018, manuscripts submitted to journals within the International Committee of Medical Journal Editors (ICJME) reporting clinical trial data must contain a data sharing statement, indicating:

  • whether the authors intend to share individual de-identified participant data

  • what specific data they intend to share

  • what other study-related documents will be made available

  • how the data will be accessible

  • when and for how long they will be made available.

The statement may then be taken into account by ICMJE editors when considering the paper for publication. Furthermore, clinical trials that begin enrolling participants on or after 1 January 2019 must include a data sharing plan in the trial’s registration if they wish to publish results in ICMJE journals. Any deviations from this plan must be disclosed in the data sharing statement when published.

ICMJE believe that scientists have a moral obligation to share clinical data to maximize the knowledge obtained from these research efforts. The committee is working with a number of groups to solve the various practical issues that they acknowledge still exist to in order to achieve their goal of universal data sharing.

Genomic Data Sharing

American College of Medical Genetics and Genomics Board of Directors suggests that broad data sharing is necessary and will improve care by making available the best data possible, by which:

  • Key clinical attributes of the phenotype of those with genetic diseases can be described;

  • The qualitative strength of the association between genetic diseases and the underlying causative genes can be established;

  • The classification of genomic variants across the range of benign to pathogenic can be established;

  • Differences in variant interpretation among laboratories can be reconciled;

  • The appropriate classification of variants of uncertain significance can be made; and

  • Standards used in variant classification can be improved.

As noted by the NIH and others, the nature of genomic data requires several specific considerations be kept in mind. In addition to being personal and unique to each individual, genomic data may, for example:

  • Be stored and used indefinitely.

  • Inform individuals about susceptibility to a broad range of conditions (some of which are unexpected given personal or family history).

  • Carry with them risks that are uncertain or unclear.

  • Be reinterpreted and change in relevance over time.

  • Raise privacy concerns (in part because of the risk of re-identification).

  • Be relevant for family members and reproductive decision-making.

  • Transform the personal decision of sharing genetic data into one with
  • familial current and future ramifications.

There are many challenges to sharing phenotypic and genotypic data. As such, some genomic data sharing best practices are listed below:

  • Investigator(s) should use requested datasets solely in connection with the research project described in the approved data request, protocol or other vehicle that bounds the use and disclosure of the data.

  • Investigator(s) will make no attempt to identify or contact individual participants from whom these data were collected without appropriate approvals from the relevant IRBs.

  • Investigator(s) will use informed consent documents with specific language addressing the potential persistent harms identified data can carry, the rights of patients to withdraw consent, the type of data involved and the respect and trust of participants who give researchers the right to use their data and samples for “future unspecified research.”

  • Investigator(s) will not distribute these data to any entity or individual beyond those specified in the approved data request, protocol or other vehicle which bounds the use and disclosure of the data.

  • Investigator(s) will by default share de-identified data.

  • Investigator(s) will strive to harmonization of data collection and archiving methods (storage) tools and representation to ensure validation of scientific quality.

  • Investigator(s) will adhere to computer security practices that ensure only authorized individuals can gain access to data files, minimizes unintended access and otherwise adhere to institutional security requirements.

NIH Genomic Data Sharing (GDS) Policy

The GDS Policy expects that large-scale genomic research data from NIH-supported studies involving human specimens will be submitted to an NIH-designated data repository. Non-human data may be submitted to any widely used repository or to the same repositories they submitted specific types of data to previously. NIH has provided examples of relevant databases on the NIH Office of Science Policy website.

  • Investigators should submit large-scale human genomic data as well as relevant associated data (e.g., phenotype and exposure data) to an NIH-designated data repository in a timely manner.

  • Investigators should also submit any information necessary to interpret the submitted genomic data, such as study protocols, data instruments, and survey tools.

  • Grants applications and protocols should include details of the required data sharing plan.

  • In general, consent documents should include language that allows for the broad future sharing of genomic data. The NIH recognizes that there will be instances where broad sharing may not be appropriate and the policy outlines the exceptions.

  • Under the GDS Policy, the release of human data for secondary research can generally be deferred for up to six months after data submission, with no publication embargo upon data release.

  • An IRB must review the foundational consents (all versions) to confirm that consent was appropriately sought, includes broad sharing (if appropriate), and to identify any limitations on the upload.

NOTE: If you are generating genomic data that does not meet the definition of “large-scale,” consider including a section in your Resource Sharing Plan that says, “Genomic Data Sharing: Not Applicable” with a brief explanation. Grant applications are not always sufficiently detailed for staff to determine whether the GDS policy applies, in which case staff must contact the PI, which can cause delays.

  • The NIH Genomic Data Sharing Policy is here.

  • Large-scale genomic data include genome-wide association studies (GWAS), single nucleotide polymorphisms (SNP) arrays, and genome sequence, transcriptomic, metagenomic, epigenomic, and gene expression data. Examples of research that are subject to the GDS Policy include, but are not limited to, projects that involve generating the whole genome sequence data for more than one gene from more than 1,000 individuals, or analyzing 300,000 or more genetic variants in more than 1,000 individuals, or sequencing more than a 100 isolates of infectious organisms such as bacteria. The Supplemental Information to the NIH Genomic Data Sharing Policy includes detailed description of research under scope of the policy and data submission expectations.

NIH Public Genomic Data Repositories

NIH is committed to respecting the privacy and intentions of research participants with regard to how data pertaining to their individual information is used. Data access is therefore intended only for scientific investigators pursuing research questions that are consistent with the informed consent agreements provided by individual research participants. Furthermore, investigators provided access will be expected to utilize appropriate controls and abide by Data Use Limitation.

The dbGAP has two repositories into which researchers can deposit or withdraw de-identified data:

  • Public Access: Non-individual genomic data can be publicly accessed through the dbGAP website.

  • Controlled Access (Individual-level data): Individual-level data submitted to the dbGaP is required to be de-identified. No names or identifiable information is attached to the data. The genetic fingerprint however is embedded in individual’s genotype data, which is not de-identifiable. That is why, to protect individuals privacy, all individual level data is only distributed through the NIH Authorized Access System.

Additional information regarding the process of depositing data into NIH Public Repositories can be found from NCBI and from the NCI.

Some Public Repositories

We here provide some details on a few (by no mean exhaustive) public repositories. If you know of another repository that would like to see added to this page please contact us. Some of these repositories do require IRB approval to download the data. Please correspond with the proper administration authority.

  • GTEx: The Genotype-Tissue Expression project is aiming to understand the difference and regulation of gene expression by tissue across 53 different tissues.
  • TCGA: The Cancer Genome Atlas is a joint effort by the NCI and the NHGRI to moleculary characterize approximately 20000 primary cancer in comparison to matched normal over 33 cancer types.
  • 1000 genomes: The 1000 genomes project sequenced 1000 genomes to find genetic variants with at least 1% variation in the human population. A useful source
  • GEO: Gene expression Omnbinus is a public genomics data repository.
  • gnomAD: The genotype aggression database is working to aggregate and harmonize sequencing data across multiple studies.
  • TOPMed: TOPMed is brought to you by NIH and NHLBI. Trans-Omics for Precision Medicine (TOPmed) aims to “understand of the fundamental biological processes that underlie heart, lung, blood, and sleep (HLBS) disorders.”
  • dbsnp: Provides information about single nuclotide polymorphisms (SNPs), as well as tracking these SNPs publication history.
  • UK Biobank: Prospective cohort study with a wealth of genetic and health data on 500,000 participants.
  • cBioPortal: This is an excellent resource for viewing/accessing TCGA cancer genomics data.
  • Sage’s is not only useful for sharing (privately or publicly) your own research data, but also where several open datasets are hosted, and also interact with the various DREAM Challenges they host.

Updated: April 17, 2020

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.