The preprocessing and transformation of input data, also known as “data wrangling,” is an essential step in the data analysis pipeline. A variety of challenges must be overcome when preparing data for analysis, including handling missing data, integrating diverse data types from multiple sources, and ensuring that both individual data fields as well as the larger organization of data are in the correct format for downstream analysis.

On this page, we provide an overview of resources for learning how to wrangle data, software for data wrangling, and tools developed at Fred Hutch. While this is not an exhaustive list, we have highlighted many of the most commonly used and readily accessible resources for data scientists.

Tidy Data

Beyond wrangling/cleaning your data, the practice of “data tidying” ensures that your datasets have a consistent structure and and are easy to manipulate, model, and visualize. Tidy datasets list individual observations as rows and variables as columns. We highly recommend you include tidying your data as a key step in your data wrangling process!

Code-based data wrangling

Data Wrangling in R

Although base R offers some basic functions for data wrangling, there are a variety of fast, intuitive packages available in the R ecosystem for cleaning, transforming, and reshaping data.

Packages for data wrangling

Packages for handling large data

Data Wrangling in Python

Python is also widely used for data wrangling, particularly for handling complex and large-scale biomedical datasets. Several libraries in Python simplify the process of cleaning, transforming, and analyzing data.

Core Libraries for Data Wrangling

Specialized Libraries for Biomedical Data

Handling Large Datasets

Community Resources

The FH-Data Slack is always available as a space for researchers to ask questions and share resources about data wrangling.

Learning Resources

Books and online tutorials can provide in-depth coverage of data wrangling techniques, offering a solid foundation for both novice and advanced biomedical data scientists.

Books for R

Books for Python

Other resources