“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” - John Tukey
“If you torture the data, it will confess to anything” - Ronald Coase.
Upfront consideration to the design and execution of large scale data generating research projects can help to prevent unfortunate or costly outcomes. This section provides guidance on and resources related to study design, human subjects research, management of clinical and experimental data, and a review of factors to consider when choosing from some of the common large scale molecular data generating platforms.
This section provides guidance for researchers looking to develop a hypothesis that will have reasonable statistical power, identify the appropriate set of samples, and execute a large scale data production from those samples. There are the two general types of studies using large scale molecular data sets, which can loosely be thought of as “investigative” and “comparative.” The two aren’t completely extricable and can each can serve as groundwork for future experiments for the other. The process to perform these types of studies, however, can be very different. The details specific to each study type are best addressed prior to generating materials or data sets.
The policies and processes that relate to the human subject components of any large scale data generating or analyzing study are continually evolving as new issues arise and become more clear. Keeping up with the particular issues that do or do not apply to a given research project can sometimes be a challenge, and these pages contain relevant guidance and links to the necessary information researchers need before, during and after a research project involving human specimens or data.
For a each study, the particular covariates associated with large scale data sets typically come from clinical or laboratory data. When these data are originating from human samples, certain protections need to be in place to ensure patient privacy. There are resources at the Fred Hutch which can help researchers effectively manage these data so that they can be associated with downstream molecular data sets more consistently and securely.
The decisions required when generating large scale data sets themselves are informed by an understanding of the specimen cohort, any limitations imposed by the consent of the patients from which those specimens were obtained, and the specific hypothesis the researcher is intending to address. This section contains guidance about generating or handling large scale data from a variety of sources, highlights the particularites of each, and include information for researchers interacting with various Fred Hutch Shared Resources.
Some of the challenges associated with executing large scale data projects can be mitigated by implementing good laboratory management and data systems before beginning the project. These sections relate to the particular systems available to Fred Hutch researchers that help them manage data generated in laboratories and also specific types of guidance about good laboratory practices with large scale data generation in mind.