De-identification generally refers to the removal of 18 identifiers as listed in HIPAA regulation 45 CFR 164.514(b). However, de-identification also means that in addition to the removal of these identifiers, the risk of re-identification, including applying methods which utilize publicly available data, is very small. Even without the 18 identifiers, individual-level genomics data could potentially identify an individual. Therefore, de-identification of genomics data also heavily relies on additional methods of privacy and security, such as adherence to strong data use limitations and practices, and strict security policy. In this section we address more specific approaches to address the need for de-identification of specimens and datasets for translational genomics studies.

What is de-identification?

De-identification refers to the removal or dissociation of direct patient identifiers from a research specimen or data set in order to inhibit the ability to deduce an individual identity. Ideally after de-identification, it would not be possible to use any remaining information alone or in combination with other readily available information to identify the subject from which the data originated. Furthermore, Human Subjects Protection dictates the identifiers of source subjects cannot be readily ascertained or otherwise associated with the data by the research staff or secondary data users (45 CFR 46.102(f)). The goal of de-identification is to reduce, to the greatest extent possible, the risk of identifying individuals from which specimens are obtained or associated genomic datasets are generated. In the setting of genomics, this would include considerations for the condition where genomic datasets generated from a human specimen may be deposited in a publicly available setting for the purposes of data sharing for further research. The most common method is the Safe Harbor method, removal of 18 identifiers as listed in HIPAA regulation (45 CFR 164.514(b)(2). A Fred Hutch guide on HIPAA and Research is on the HDC Compliance and Security Centernet Page.

Ethnicity, gender, age, marital status, geographical location, and preferred language are types of descriptors (indirect identifiers) which when combined can enable a patient to be re-identified. This presents particular concerns with regard to privacy, stigmatization, and discrimination, since the ability to protect the confidentiality of these individuals or groups participating in the research is diminished. For example, members of an identifiable population may be stigmatized or discriminated against if research reveals that the group is at high risk of having a genetic variant associated with a particular disease. For some communities, close family relationships also may make it especially challenging to protect participants’ privacy, even if research samples are de-identified. To ensure confidentiality, not only direct identifiers should be removed. Indirect identifiers, such as date of birth, location, marital status, preferred language and ethnicity should also be reviewed and removed when possible.

Some types of individual-level genomic data can be used to identify an individual even without the 18 identifiers. Thus, de-identification of genomics data also heavily relies on additional methods of confidentiality and security that are unique to the particular data type involved, such as adherence to strong data use limitations and practices. Genetic data (generally considered to refer to the sequence of a person's genome, though there is still ambiguity about how that relates to different types of genomic datasets) is considered Strictly Confidential under the FH Data Classification and Handling Standard.

How can a specimen be de-identified?

Under HIPAA, specimens/private information can be de-identified by replacing direct (and indirect) identifiers with a masking/coding schema. Masking schema should use coding that does not include any component of the identifiable patient data or have a direct relationship with them which can be ensured via randomization of coded identifiers. For data that may need to be re-identified later, retaining appropriate documentation of the mapping between identifiable data from patients, specimens or datasets, and the coded identifiers generated is critical. Depending on the study design and protocol, the entity retaining this mapping documentation will be responsible for preventing non-approved re-identification. In some cases if the research group maintains this mapping documentation then their research will still be considered human subjects, though if the mapping is maintained by a non-involved 3rd party the research may be considered exempt.

There is often a question of whether the HIPAA de-identification process is enough to render de-identified specimens/private information sufficiently non-identifiable for exemption from human subjects research. A specimen de-identified under HIPAA is not the same as a non-identifiable specimen for research (Office of Human Research Protection). Under research, a specimen rendered non-identifiable under 45 CFR 46.102(f) could qualify as exempt from human subjects research. A non-identifiable specimen (thus not human subject research) must meet the following two requirements:

  1. The private information or specimens are NOT collected for a specific research project through an interaction or intervention with living individuals; and

  2. The investigator(s) cannot readily ascertain the identity of the individual(s) to whom the coded private information or specimens pertain because, for example:

How can a genomic dataset be de-identified?

If all direct and indirect identifiers are removed from a genomic data set, the genomic data itself may still be able to identify a patient depending on the data type and degree of processing of the specific data entity. The inclusion of racial, ethnic and genders in scientific research however, can be a reason to retain some of these indirect identifiers in the context of research datasets. Thus, in addition, confidentiality is supported through strong use limitations, data use agreements and appropriate data security.