run dada2 amplicon sequencing analysis

Author:Nick Ilott
Release:$Id$
Date:Jan 23, 2020
Tags:Python

Purpose

The purpose of this pipeline is to take amplicon sequencing data and run it through the DADA2 pipeline - this includes filtering and trimming reads, merging and clustering reads and assigning reads to taxa.

Dependencies

In order to run this pipeline you will need to install the following packages:

  • CGAT tools. The documentation can be found here

Installation

You should install CGAT tools using the installation script that is provided, this will install everything you need into a conda environment.

To activate the enviroment do the following:

source </full/path/to/folder/without/trailing/slash>/conda-install/etc/profile.d/conda.sh
conda activate base
conda activate cgat-flow

# finally, please run the cgatflow command-line tool to check the installation:
cgatflow --help

Warning

Make sure that you are installing CGAT into a clean environment i.e. with no modules loaded from shared directories. You can do something like module purge to remove loaded modules.

However it does not come with dada2 installed. Dada2 requires RcppParallel to be installed so first, so once you have activated the CGAT environment install this:

conda install -c conda-forge r-rcppparallel

Then you should install the dada2 R package as follows:

conda install -c bioconda bioconductor-dada2

This should install dada2 so that when you start R you can simply type:

library("dada2")

You will in all likelihood also have to install optparse for argument parsing in the R scripts:

conda install -c conda-forge r-optparse

Once you have successfully installed CGAT, dada2 and optparse, you should then download the NGSKit repository - this contains the dada2 pipeline (/pipelines/pipeline_dada2.py). In a development folder run:

git clone git@github.com:nickilott/NGSKit.git

It is possible that the modules that are required for the pipeline to run will have to be placed into your PYTHONPATH:

PYTHONPATH=$PYTHONPATH:<path-to-NGSKit>
export $PYTHONPATH

Input files

The input is a directory of fastq formatted files. These should be placed in the directory in which you wish to run the pipeline. They must be of the format <name>.fastq.1.gz for single-end data and two files <name>.fastq.1.gz and <name>.fastq.2.gz for paired-end data.

Parameterisation

There is a pipeline.yml file that specifies the parameters to be passed to dada2 and reporting functions in the pipeline. This is located in <path-to-NGSKit>/pipelines/pipeline_dada2/pipeline.yml and needs to be copied to your working directory. The parameters should be changed to suitable values. and are explained in the dada2 tutorial.

Output files

The main output file of the pipeline is the counts matrix that consists of amplicon sequence variants and their abundance in each sample. The pipeline assigns taxonomy to each ASV and this is incorporated into the ASV name in the resulting file. It is of the form:

test_id Sample1 Sample2
ASV1:p__phylum1;c__class1;o__order1;f__family1;g__genus1;s__species1 1000 1239
ASV2:p__phylum2;c__class2;o__order2;f__family2;g__genus2;s__species2 500 10
ASV3:p__phylum3;c__class3;o__order3;f__family3;g__genus3;s__species3 1000 2300

The purpose of this output file is that it can be taken forward in a easy fashion to look at differential abundance using software such as DESeq2 and this will be done on a per ASV level. If you wish to perform analysis on counts that have been summed over taxa at a particular taxonomic level you can use the following output files:

  • taxonomy_abundances.dir/phylum_abundances.tsv
  • taxonomy_abundances.dir/class_abundances.tsv
  • taxonomy_abundances.dir/order_abundances.tsv
  • taxonomy_abundances.dir/family_abundances.tsv
  • taxonomy_abundances.dir/genus_abundances.tsv
  • taxonomy_abundances.dir/species_abundances.tsv

Running the pipeline

Running the pipeline should be as simple as:

python <path-to-NGSKit>/pipelines/pipeline_dada2.py -v5 -p50 make full

Set p to the number of processes that you want to run - this corresponds to the number of the samples that you want to run (unless it is a huge number and exceeds a reasonable number to run on the cluster).

Reporting

Once the pipeline has finished you can build a report by running:

python <path-to-NGSKit>/pipelines/pipeline_dada2.py -v5 -p50 make build_report