Overview of Data Management for Researchers

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

While the primary focus of research is the processes more clearly involved in the scientific endeavor, often the importance of knowing some degree of best practices for various data oriented issues is understated. The Generation section has an important section on data privacy and security when data relates to humans or human specimens. The Computing section of this site contains a wide array of detailed information about the resources and processes provided by Center IT and Scientific Computing specifically. However, here are aim to summarize the essential points of the best practices for the following topics here at the Fred Hutch.

Data Storage Best Practices

Depending on the type of data used by a research group, different combinations of data storage options may be appropriate. Assessing the strengths, mode of access, and interactivity with computing resources of each of these types of data storage options alongside an assessment of the types of data a research group use and the type of interactions with those data is becoming more important for researchers. This assessment also is becoming more important to repeat over time as data storage and connectivity infrastructure changes and as the particular data types and interactions of research groups evolve.

Data Ingestion and Public Datasets Best Practices

Large scale datasets for a study can come from multiple sources, such as those generated by an outside sequencing center or the Fred Hutch Genomics Shared Resource or other large databases. Additionally, a study might rely on publicly available datasets in a repository with some degree of managed access. Regardless of the source, if bioinformatic or analytic processing of the data is required, data will need to be accessible via a Fred Hutch managed data storage, compute resource, or workstation. Below, we will outline some of the different approaches that exist to ingest and store only the most relevant portions of the data. Additionally being selective about what aspects and to what degree public datasets need to be copied to a local storage space can ensure that data storage and cost issues do not arise unnecessarily. Being knowledgeable about the various repositories and modes of access that exist for data of interest will ensure that research progress is neither hindered by data access challenges nor by incurred costs from unnecessary data storage.

Using Scratch

Scratch storage space is temporary storage that can be leveraged during workflows to store intermediate files or working copies of raw data while an analysis is being performed. The benefits and limitations of using Scratch storage are discussed here as well as some guidance for how to structure your workflows to best employ Scratch.

Data Archiving Best Practices

Typically the need to actively archive data is fairly rare, but as research datasets and digital research assets become larger and larger such as for imaging or genomics data, a strategy for managing data archiving becomes important. Also, as data management plans and public repository deposition of data increasingly require researchers to steward larger and larger data sets in a more active manner. Thus prioritizing active data management during the research process can be important in order to avoid massive loss of time and resources due to the need to slog through old data to comply with current funding or publication requirements.

Updated:

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.