Data Ingestion and Public Datasets Best Practices

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.

Large-scale research data can come from multiple sources, like one of the Fred Hutch Shared Resources, external vendors, external collaborators or public repositories. Regardless, if processing or analysis is required, your data will need to be accessible via a Fred Hutch managed data storage, compute resource, or workstation. Being selective about what degree public datasets need to be copied to a local storage space can lower your project costs.

Below, we outline some different approaches to ingest and store only the most relevant portions of the data.

Data Ingestion for Externally Generated Data

For data from non-Fred Hutch entities that you would like to transfer to Fred Hutch-managed storage for further analysis, there is a multi-step process that Scientific Computing can assist you with. Large scale biomedical datasets have higher risks of data corruption and transfer interruption. Also, “intermediate” data may also need to be generated during analyses by Fred Hutch investigators. Thus, it is important for you to work with Scientific Computing to ensure that the external data are transferred completely, and you have secure and affordable storage on FredHutch systems

The process is generally:

OR

  • Provide the sequencing center or data source the information needed to copy the data into one of the Fred Hutch Managed Amazon S3 transfer buckets.

THEN

  • Then validate the md5 checksums of the data against the checksum info (usually a text file containing md5sums) provided by the sequencing center or data source. (This checks for data corruption or incomplete transfers)

  • Transfer the validated data to the PI’s Economy storage.

Available Resources

Data Locations for Fred Hutch Shared Resource-Generated Data

For data made by Fred Hutch researchers via the Genomics Shared Resource, the default data deposition is currently managed directly by Genomics, and will result in the data being made available to the researchers via their Fast storage ( e.g., at path /fh/fast/lastname_f/SR/ngs for sequencing data). Other types of datasets are transferred to researchers in either a dnaarray directory or via other forms of transfer specific to the platform type or data source. This allows for rapid access to recently generated datasets. However, once data generated via the Genomics Core becomes of primary interest to archive for occasional use, it is a good idea to visit the Data Storage section and consider implementing the active data management scheme described above with the assistance of Scientific Computing.

For example, depending on the intended use of the datasets, it may be desirable once data is generated by the Genomics Shared Resource to archive the data to the researcher’s Economy storage space, with a copy put in Scratch for immediate processing. The specific organization of archive and working copies of data will depend on the particular project involved.

Available Resources

  • For consulting about how to handle large amounts of externally or internally generated data email scicomp.
  • For additional assistance regarding data generated via the Fred Hutch Genomics Shared Resource, email bioinformatics.

Publicly Available Datasets

There are multiple sources and tiers of data available publicly. In order to avoid, for example, a researcher having to pay to host large, raw datasets that are publicly available, there are approaches to accessing and documenting only the minimum required data. Knowing how best to approach publicly available large scale data sets can both make a study far more productive in a shorter period of time and save resources from being unnecessarily spent on generating new data or storing copies of existing datasets when not required for the particular research they are being used for. This section will have more to come on this topic.

Available resources

  • cBioPortal is an excellent web-accessible resource to query various publicly available study data from projects such as TCGA or other more specific studies.
  • Sage Bionetwork’s Synapse platform hosts and organizes several open research projects that involve large scale molecular data sets and researchers can follow their documentation to download data through the web or python clients.
  • ImmuneSpace has publicly-accessible RNASeq, HAI, and flow-cytometry data from the Human Immunology Project Consortium
  • CAVD DataSpace has publicly-accessible data from HIV vaccine studies

Updated:

Edit this Page via GitHub       Comment by Filing an Issue      Have Questions? Ask them here.