6. Walkthrough of pre-constructed pipeline

6.1. List of pipelines

  • WGS data analysis Enables the user to identify variants from raw sequencing reads and functionally annotate them. Multiple tools are integrated in the WGS analysis pipeline.
  • RNA-Seq data analysis RNA-Seq part enables quantification of gene expression. The pipeline provides a final output of a list of differentially expressed genes. (to be constructed as provided in the CANEapp manual)

https://github.com/anjanaanilkumar1289/iCOMIC_doc/blob/master/docs/screenshots/pipeline.png?raw=true Figure 1: The figure below depicts the available pipelines supported by iCOMIC

6.2. Description of the tools used

Table shows the tools incorporated in iCOMIC

Function DNA-Seq Tools RNA-Seq Tools
Quality Control FastQC, MultiQC, Cutadapt FastQC, MultiQC, Cutadapt
Alignment GEM-Mapper, BWA-MEM, Bowtie2 STAR, HISAT2
Variant Calling GATK HC, samtools mpileup, FreeBayes, GATK Mutect2 -
Annotation Annovar, SnpEff -
Expression Modeller - StringTie, HTSeq
Differential Expression - DESeq2, ballgown

6.3. Tools for Quality Control

FastQC

It is a popular tool that can be used to provide an overview of the basic quality control metrics for raw next generation sequencing data. There are a number different analyses (called modules) that may be performed on a sequence data set. It provides summary graphs enabling the user to decide on the directions for further analysis.

MultiQC

MultiQC is a modular tool to aggregate results from bioinformatics analyses across multiple samples into a single report. It collects numerical stats from different modules enabling the user to track the behavior of the data in an efficient manner.

Cutadapt

Cutadapt is a trimming tool that enables the user to remove adapter and primer sequences in an error-tolerant manner. It can also aid in demultiplexing, filtering and modification of single-end and paired-end reads. Essential parameters for the tool are listed below. The detailed list of parameters of the tool are available in Cutadapt documentation.

Parameter Description
-a 3’ Adapter sequence
-g 5’ adapter sequence
-Z Compression level
-u (n) Removes n reads unconditionally
-q Quality cutoff

Aligners

GEM-Mapper

It is a high-performance mapping tool that performs alignment of sequencing reads against large reference genomes. GEM Mapper has been identified as an efficient mapping tool by a benchmarking analysis performed along with this study. Listed below are some parameters of the tool GEM-Mapper. Other parameters can be found in GEM-Mapper github page

Parameter Description
-t Threads
-e --alignment-max-error
--alignment-global-min-identity Minimum global-alignment identity required
--alignment-global-min-score Minimum global-alignment score required

BWA-MEM

One of the most commonly used aligners available. It is identified as a faster and accurate algorithm among the algorithms in BWA software package. It is known for aligning long sequence query reads to the reference genome and also performs chimeric alignment. The parameters for BWA-MEM include the following. The other parameters for the tool can be found in BWA manual page

Parameter Description
-t Threads
-k minSeedLength
-w Band width
-d Z-dropoff
-r seedSplitRatio
-A matchScore

Bowtie2

Bowtie2 is a fast and efficient algorithm for aligning reads to a reference sequence. It comprises various modes wherein it supports local, paired-end and gapped alignment. The key parameters for Bowtie2 include the following. All parameters for Bowtie2 are listed in Bowtie2 manual.

Parameter Description
--threads Threads
--cutoff (n) Index only the first (n) bases of the reference sequences (cumulative across sequences) and ignore the rest.
-seed The seed for pseudo-random number generator
-N Sets the number of mismatches to allowed in a seed alignment during multiseed alignment
dvc the period for the difference-cover sample

STAR

STAR is a rapid RNA-Seq read aligner specializing in fusion read and splice junction detection. Important parameters for STAR is given below. The other parameters for the tool can be found in STAR manual page

parameters Description
-- runThreadN NumberOfThreads
--runMode genomeGenerate
--genomeDir /path/to/genomeDir
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2
--sjdbGTFfile /path/to/annotations.gtf
--sjdbOverhang ReadLength-1

HISAT2

It is a fast and sensitive alignment program applicable for both RNA-seq and Whole-Genome Sequencing data and is known for rapid and accurate alignment of sequence reads to a single reference genome. The key parameters for the tool are given below. The other parameters for the tool can be found in HISAT2 manual page

parameters Description
- x (hisat-idx) The basename of the index for the reference genome
-q Reads which are FASTQ files
--n-ceil (func) Sets a function governing the maximum number of ambiguous characters (usually Ns and/or .s) allowed in a read as a function of read length
--ma (int) Sets the match bonus
--pen-cansplice (int) Sets the penalty for each pair of canonical splice sites (e.g. GT/AG)

Variant Callers

GATK HC

Variant caller for single sample analysis

One of the extensively used variant callers. Calls variants from the aligned reds corresponding to the reference genome. Some of the parameters for GATK Haplotype caller are listed beow. The complete parameter list is available at GATK Haplotypecaller article page

Parameters Description
-contamination Contamination fraction to filter
-hets heterozygosity
-mbq Min base quality score
-minReadsPerAlignStart Min Reads Per Alignment Start

Samtools mpileup

Variant caller for single sample analysis

Samtools mpileup together with BCFtools call identifies the variants. Some key parameters to look are listed below. Parameters in detail are found in Samtools-mpileup manual page

Parameter Description
-d --max-depth
-q Minimum mapping quality for an alignment to be used
-Q Minimum base quality for a base to be considered

freebayes

Variant caller for single sample analysis

FreeBayes is a variant detector developed to identify SNPs, Indels, MNPs and complex variants with respect to the reference genome. Key parameters for FreeBayes are listed below. Other parameters can be found in detain in FreeBayes parameter page

Parameter Description
-4 Include duplicate-marked alignments in the analysis.
-m minimum mapping quality
-q minimum base quality
-! minimum coverage
-U read mismatch limit

GATK Mutect2

Variant caller for normal-tumor sample analysis

This tool identifies somatic mutations such as indels and SNAs in a diseased sample compared to the provided normal sample, using the haplotype assembly strategy. Parameters specific to Mutect2 include the following. The complete parameter list is available at GATK Mutect2 manual page

Parameter Description
--base-quality-score-threshold Base qualities below this threshold will be reduced to the minimum
--callable-depth Minimum depth to be considered callable for Mutect stats. Does not affect genotyping.
--max-reads-per-alignment-start Maximum number of reads to retain per alignment start position.
-mbq Minimum base quality required to consider a base for calling

Annotators

SnpEff

SnpEff tool performs genomic variant annotations and functional effect prediction. Key parameters for the tool SnpEff are listed below. Detailed list of parameters is given in SnpEff manual page

Parameter Description
-t Use multiple threads
-cancer perform 'cancer' comparisons (Somatic vs Germline)
-q Quiet mode
-v Verbose mode
-csvStats Create CSV summary file instead of HTML

Annovar

Annovar can be used to efficiently annotate functional variants such as SNVs and indels, detected from diverse genomes. The tool also provides the user with multiple annotation strategies namely Gene-based, region-based and filter-based. Key parameters for Annovar include the following. Details of the tool can be found in Annovar documentation page

Parameter Description
--splicing_threshold distance between splicing variants and exon/intron boundary
--maf_threshold filter 1000G variants with MAF above this threshold
--maxgenethread max number of threads for gene-based annotation
--batchsize batch size for processing variants per batch (default: 5m)

Expression modellers

StringTie

StringTie is known for efficient and rapid assembly of RNA-Seq alignments into possible transcripts. It employs a novel network flow algorithm and an optional de novo assembly algorithm to assemble the alignments. The important parameters to look into are listed below. The other parameters for the tool can be found in StringTie Manual Page

Parameters Description
--rf Assumes a stranded library fr-firststrand
--fr Assumes a stranded library fr-secondstrand
--ptf (f_tab) Loads a list of point-features from a text feature file (f_tab) to guide the transcriptome assembly
-l (label) Sets (label) as the prefix for the name of the output transcripts
-m (int) Sets the minimum length allowed for the predicted transcripts

HTSeq

HTSeq facilitates in counting the number of mapped reads to each gene. It provides the user with multiple modes of usage and also allows the creation of custom scripts. Key parameters for the tool are given below. The other parameters for the tool can be found in HTSeq Manual Page

Parameters Description
-f Format of the input data
-r For paired-end data, the alignment have to be sorted either by read name or by alignment position
-s whether the data is from a strand-specific assay
-a skip all reads with alignment quality lower than the given minimum value
-m Mode to handle reads overlapping more than one feature

Differential Expression tools

ballgown

ballgown is an R language based tool that enables the statistical analysis of assembled transcripts and differential expression analysis along with its visualization. Key arguments for the tool are given below. The other arguments for the tool can be found in ballgown Manual page

Arguments Description
samples vector of file paths to folders containing sample-specific ballgown data
dataDir file path to top-level directory containing sample-specific folders with ballgown data in them
samplePattern regular expression identifying the subdirectories of\ dataDir containing data to be loaded into the ballgown object
bamfiles optional vector of file paths to read alignment files for each sample
pData optional data.frame with rows corresponding to samples and columns corresponding to phenotypic variables
meas character vector containing either "all" or one or more of: "rcount", "ucount", "mrcount", "cov", "cov_sd", "mcov", "mcov_sd", or "FPKM"

DESeq2

It Uses negative binomial distribution for testing differential expression using R language. Some of the arguments to look into are given below. The other arguments for the tool can be found in DESeq2 Manual Page

Arguments Description
object A Ranged Summarized Experiment or DESeqDataSet
groupby a grouping factor, as long as the columns of object
run optional, the names of each unique column in object
renameCols whether to rename the columns of the returned object using the levels of the grouping factor
value an integer matrix