More to come here. We will develop and share guidance and information about useful tools, approaches, software and strategies for sharing data, be it as a requirement from a funding source, a publication, or during the course of your research with collaborators.
Data Sharing in the realm of genomics and large scale datasets has highlighted some specific new challenges and possibilities. The sharing of large scale research data has potential to strengthen academic medical research, the practice of medicine, and the integrity of the clinical trial system. Some benefits are obvious: when researchers have access to complete data, they can answer new questions, explore different lines of analysis, and more efficiently conduct large-scale analyses across trials or projects. However, our evolving collective understanding of data sharing practices when large-scale datasets are involved can tend to result in an unnecessary burden on the research(ers) that is actually both counterproductive and may not necessarily make the patient or researcher any safer. This section can help guide decision making and actions to successfully share and manage research data to allow for the most productivity and facilitation of the original research itself while balancing the data privacy and security needs of those involved.
We here provide some details on a few (by no mean exhaustive) public repositories. If you know of another repository that would like to see added to this page please contact us. Some of these repositories do require IRB approval to download the data. Please correspond with the proper administration authority.
As noted by the NIH and others, the nature of genomic data requires several specific considerations be kept in mind. In addition to being personal and unique to each individual, genomic data may, for example:
Be stored and used indefinitely.
Inform individuals about susceptibility to a broad range of conditions (some of which are unexpected given personal or family history).
Carry with them risks that are uncertain or unclear.
Be reinterpreted and change in relevance over time.
Raise privacy concerns (in part because of the risk of re-identification).
Be relevant for family members and reproductive decision-making.
There are many challenges to sharing phenotypic and genotypic data. As such, some genomic data sharing best practices are listed below:
Investigator(s) should use requested datasets solely in connection with the research project described in the approved data request, protocol or other vehicle that bounds the use and disclosure of the data.
Investigator(s) will make no attempt to identify or contact individual participants from whom these data were collected without appropriate approvals from the relevant IRBs.
Investigator(s) will use informed consent documents with specific language addressing the potential persistent harms identified data can carry, the rights of patients to withdraw consent, the type of data involved and the respect and trust of participants who give researchers the right to use their data and samples for “future unspecified research.”
Investigator(s) will not distribute these data to any entity or individual beyond those specified in the approved data request, protocol or other vehicle which bounds the use and disclosure of the data.
Investigator(s) will by default share de-identified data.
Investigator(s) will strive to harmonization of data collection and archiving methods (storage) tools and representation to ensure validation of scientific quality.
Investigator(s) will adhere to computer security practices that ensure only authorized individuals can gain access to data files, minimizes unintended access and otherwise adhere to institutional security requirements.
NIH is committed to respecting the privacy and intentions of research participants with regard to how data pertaining to their individual information is used. Data access is therefore intended only for scientific investigators pursuing research questions that are consistent with the informed consent agreements provided by individual research participants. Furthermore, investigators provided access will be expected to utilize appropriate controls and abide by Data Use Limitation.
The dbGAP, for example, has two repositories into which researchers can deposit or withdraw de-identified data:
Public Access: Non-individual genomic data can be publicly accessed through the dbGAP website.
Controlled Access (Individual-level data): Individual-level data submitted to the dbGaP is required to be de-identified. No names or identifiable information is attached to the data. The genetic fingerprint however is embedded in individual’s genotype data, which is not de-identifiable. That is why, to protect individuals privacy, all individual level data is only distributed through the NIH Authorized Access System.
Additional information regarding the process of depositing data into NIH Public Repositories can be found from NCBI and from the NCI.