01 - data prep

The following pages will walk through an example of how to initialize a MitoPilot project with your own data.

To get started, you will need: - a directory containing all of your sequence data - a CSV mapping file

First, let’s create a directory to house our project. On the command line, run the following:

mkdir -p /pool/public/genomics/${USER}/MitoPilot_workshop/my_project

Now we need some sequence data. An example data directory is located on Hydra at /PATH/TO/DATA. This directory contains two FASTQ files per sample (the forward and reverse reads).

Let’s copy the data to our new project directory.

cp -rf /PATH/TO/DATA /pool/public/genomics/${USER}/MitoPilot_workshop/my_project

Next we need to create a CSV mapping file with the following required columns: - ID: column with a unique identifier for each sample - Taxon: column containing taxonomic information for each sample, no formatting requirements - R1: full name of the forward read file - R2: full name of the reverse read file

Normally you would create this metadata sheet from scratch in Excel, but for the workshop you can copy the provided file:

cp -rf /PATH/TO/map_file.csv /pool/public/genomics/${USER}/MitoPilot_workshop/my_project

This mapping spreadsheet can contain extra columns with additional metadata. In the example file, there is an extra column Family. These extra metadata fields can be useful for sorting and grouping samples later on.