Data Storage Guidance

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

Most Fred Hutch based researchers using large scale biomedical data sets initially store their data using Fast storage alongside their smaller scale laboratory data. This provides direct, rapid access to files both manually (e.g., via mapping a drive to a workstation) and to local computing resources (via our HPC cluster, see below). However, a strategy for where, when and for how long to store what size data is important to create to ensure that data access by researcher or compute resource, transfer and archiving are not unnecessarily complicated thus hindering the research process.

Depending on the type of data used by a research group, different combinations of data storage options may be appropriate. Assessing the strengths, mode of access, and interactivity with computing resources of each of these types of data storage options alongside an assessment of the types of data a research group use and the type of interactions with those data is becoming more important for researchers. This assessment also is becoming more important to repeat over time as data storage and connectivity infrastructure changes and as the particular data types and interactions of research groups evolve.

Overview of Storage Resources

More detailed documentation regarding data storage at Fred Hutch can be found in the Computing domain here which includes additional information about data storage in databases.

Storage Resource Costs (per TB/month)* Backup Location/Duration Best Use
Home Free to 100GB limit 7 days of Snapshots, Daily backups, Off Site copy Data specific to a user, not shared to others, relatively small data sets
Fast $$$ beyond 5TB per PI 7 days of Snapshots, Daily backups, Off Site copy Large data sets that need high performance access to computing resources, Unix file permissioning, but neither PHI nor temporary data (such as intermediate files)
Secure beyond 1TB per PI 7 days of Snapshots, Daily backups, Off Site copy PHI containing datasets or those that require auditing, relatively small datasets
Economy Local and Cloud $ beyond 5TB per PI Multi-datacenter replication, 60 day undelete with request to helpdesk Best for archiving large data sets, or primary storage of large files. Good for PHI or other data that requires encryption and auditing. Requires Desktop Client to access, see Object Storage page.
Scratch Free Not applicable Temporary files, such as those intermediate to a final result that only need to persist during a job. Appropriate use can significantly reduce data storage costs, see Scratch Storage and Using Scratch pages.

Note: All admin contact for assistance with dat storage can be initiated by emailing helpdesk, but different system administrators are primary for different platforms. Admin assistance can be requested for data transfers and validation, data import, as well as restoring data from backups for a given resource if available.

  • Contact Scientific Computing staff by emailing scicomp for help with:
    • identifying current storage space usage
    • assistance with identification of possible duplicated data sets
    • guidance with implementing active data management practices

Data Locations for Fred Hutch Shared Resource-Generated Data

For data made by Fred Hutch researchers via the Genomics Shared Resource, the default data deposition is currently managed directly by Genomics, and will result in the data being made available to the researchers via their Fast storage ( e.g., at path /fh/fast/lastname_f/SR/ngs for sequencing data). Other types of datasets are transferred to researchers in either a dnaarray directory or via other forms of transfer specific to the platform type or data source. This allows for rapid access to recently generated datasets. However, once data generated via the Genomics Core becomes of primary interest to archive for occasional use, it is a good idea to visit the Data Storage section and consider implementing the active data management scheme described above with the assistance of Scientific Computing.

For example, depending on the intended use of the datasets, it may be desirable once data is generated by the Genomics Shared Resource to archive the data to the researcher’s Economy Local storage space, with a copy put in Scratch or Economy Cloud for immediate processing. The specific organization of archive and working copies of data will depend on the particular project involved.

  • For consulting about how to handle large amounts of externally or internally generated data email scicomp.
  • For additional assistance regarding data generated via the Fred Hutch Genomics Shared Resource, email bioinformatics.

Updated:

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.