Task Optimized Storage
Updated: May 12, 2023
Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.Task storage is storage for data used in ad-hoc applications when the primary or principal storage location for that data is unsuitable or undesirable for that application.
It is critical that you do not store the primary or only copy of a data set in locations designated as task or scratch storage. These storage paths are not backed up and not designed for long-term storage of data.
In bioinformatics workflows often use pipelines with many execution steps. Each of these steps can create a large amount of temporary data. This data often needs to be kept for a short period of time to allow for quality assurance. Informaticians often do not delete this data after this step because they are already off to the next task.
For example: you have the primary copy of sequencing data in cloud storage (e.g. AWS S3). This location is not usable for analysis on the gizmo compute cluster and thus you copy it to a location where it is usable by gizmo analysis tools (e.g. a network-attached storage server like Scratch).
If temporary data is created in a standard file system such as Fast storage it will both be captured in the file system snapshots and can be picked up by the backup system and copied offsite. If data is frequently created and deleted these snapshots and backups can grow to 5 to 10 times the size of the original data.
To prevent this, everyone working with large datasets should incorporate task storage into their workflow.
Recommendations
Use of task storage requires careful thought and consideration. Given the volitile nature of the storage the responsibility falls to you- the data steward- to ensure that you have accounted for risks you take on as part of using these storage systems.
Some general guidelines for using task storage:
DO NOT keep the primary copy of data in a task storage location. Before storing your data set in a task storage location, ensure that the primary data set is in durable storage (such as fast or S3) or that the primary data set is easily retrievable from it’s source (e.g. your collaborators, DbGap, EMBL,)
DO make sure that data generated into a task storage path is either a) easily regenerated or b) copied to durable storage immediately after generation. Some intermediate data may not need to be copied if regenerating isn’t too difficult or computationally costly. This is a judgement you will have to make in determining if intermediate data requires durable storage.
DO consider using a workflow manager to manage data in task storage. Workflow managers can be set up with targets to ensure data is present in task storage before the task begins as well as staging out generated data after successful completion of a task
The Data Science Lab and Scientific Computing teams are available to help you identify and mitigate these risks- please reach out to us with any questions or to help with your plans.
Storage Options for Tasks
Working
The Working file system is available on all rhino/gizmo compute nodes. It is a high-performance storage system mounted by rhino/gizmo nodes on the path /fh/working
. This file system is not backed up, though snapshots are available. There is a quota applied at the top level of the PI directory- the default is 20TB but can be increased to 50TB- note that quota increases will incur costs.
feature | availability |
---|---|
backups | no |
snapshots | yes |
offsite replication/dr | no |
charges | over 20tb |
purged when unused | no |
Temp
The Temp file system is available on all rhino/gizmo compute nodes and can be mounted on managed OSX and Windows workstations. It is a high-performance storage system mounted by rhino/gizmo nodes on the path /hpc/temp
. This file system is not backed up, though snapshots are available. Two snapshots are created every half-hour (meaning you can go back to a copy from up to an hour ago).
Files on temp are deleted 30 days after creation- this is not tied to any of the traditional Unix attributes (atime, ctime, etc.)
feature | availability |
---|---|
backups | no |
snapshots | yes |
offsite replication/dr | no |
charges | none |
purged when unused | yes |
Scratch
The networked Scratch file system is available on all rhino/gizmo compute nodes and as part of the ‘center’ file server (i.e. //center.fhcrc.org/fh/scratch
)
This storage space is based on a high performance file system (BeegFS) and designed for high-throughput computing on large data sets in the rhino/gizmo environment. This file system does not have snapshots and is not backed up.
feature | availability |
---|---|
backups | no |
snapshots | no |
offsite replication/dr | no |
charges | none |
purged when unused | yes |
Job Local Storage
When you submit a job to gizmo and are allocated a node, you will also have task storage provisioned on local disk the duration of the job. This is local to the node (i.e. a directly-attached disk), but is removed when the job is complete (failed or successful).
More documentation on using this storage is available here
feature | availability |
---|---|
backups | no |
snapshots | no |
offsite replication/dr | no |
charges | none |
purged when unused | yes |
Cloud Scratch
The bucket “fh-pi- lastname-f -nextflow-scratch” is configured as part of the standard cloud account. This bucket is described in greater detail here.
feature | availability |
---|---|
backups | no |
snapshots | no |
offsite replication/dr | limited |
charges | yes |
purged when unused | yes |
Updated: May 12, 2023
Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.