01 - data prep

The following pages will walk through an example of how to initialize a MitoPilot project with your own data.

To get started, you will need:

a directory containing all of your sequence data
a CSV mapping file

First, let’s create a directory to house our project. On the command line, run the following:

mkdir -p /pool/public/genomics/${USER}/MitoPilot_workshop/my_project

Now we need some sequence data. The workshop GitHub repo has some example data, consisting of four octocoral species. This directory contains two FASTQ files per sample (the forward and reverse reads).

Let’s copy the data to our new project directory.

cd /pool/public/genomics/${USER}/MitoPilot_workshop/my_project
wget https://github.com/SmithsonianWorkshops/MitoPilot_workshop_2025/raw/refs/heads/main/sample_data/raw_data.tar.gz
tar -zxvf raw_data.tar.gz

Next we need to create a CSV mapping file with the following required columns: - ID: column with a unique identifier for each sample - Taxon: column containing taxonomic information for each sample, no formatting requirements - R1: full name of the forward read file - R2: full name of the reverse read file

Normally you would create this metadata sheet from scratch in Excel, but for the workshop you can download the provided file:

cd /pool/public/genomics/${USER}/MitoPilot_workshop/my_project
wget https://raw.githubusercontent.com/SmithsonianWorkshops/MitoPilot_workshop_2025/refs/heads/main/sample_data/map_file.csv

This mapping spreadsheet can contain extra columns with additional metadata. In the example file, there are a few extra columns: Order, Tissue ID, and SeqID#. These extra metadata fields can be useful for sorting and grouping samples later on.