9. Tutorials¶
9.1. What is it?¶
iCOMIC
(Integrating Context Of Mutation In Cancer) is an open-source, standalone tool for genomic data analysis characterized by a Python-based Graphical User Interface, automated bioinformatics pipelines for analyzing Whole genome/exome
and transcriptomic
data along with Machine Learning tools, cTaG
and NBDriver
for cancer related data analysis. It serves as a point and click application facilitating genomic data analysis accessible to researchers with minimal programming expertise. iCOMIC takes in raw sequencing data in FASTQ format as an input, and outputs insightful statistics on the nature of the data. iCOMIC toolkit is embedded in Snakemake
, a workflow management system and is characterized by a user-friendly GUI built using PyQt5
which improves its ease of access. The toolkit features many independent core workflows in both whole genomic and transcriptomic data analysis pipelines.
9.2. Prerequisites:¶
- Linux/Windows/Mac (MacOS v10.15.5 or above) platform
- Python 3.6 and above
- Miniconda
- iCOMIC package downloaded from GitHub
9.3. Conda Installation¶
The entire source code for the tool is available at this link
Create an environment and install the dependencies associated with iCOMIC by using the following command:
$ cd iCOMIC-main
$ conda env create -f icomic_env.yml #for Linux users
$ conda env create -f icomic_env_mac.yml #for MacOS users
$ conda activate icomic_env
$ pip install -e icomic #for the first time only
$ cd icomic #path/to/icomic directory
$ icomic
$ conda deactivate #after completing the analysis
9.4. Testing¶
The user can test the iCOMIC pipeline using the demo data, reference genome, known variants file, annotation file to run DNA-Seq, RNA-Seq, cTaG and NBDriver provided in this link.
The path to the respective samples are:
- DNA samples - dna_germline_samples or dna_somatic_samples
- RNA samples - rna_samples
- cTaG - variants.maf
- NBDriver - NBDriver_vcf.vcf
After downloading and creating the environment to run iCOMIC, open iCOMIC GUI by simply typing ‘icomic’ in the terminal and then follow the steps provided below:
DNA-Seq¶
Germline variant calling
- Click upload from Table
- Sample table path - dna_germline_samples/unit_samples.tsv
NOTE:Edit the paths inside the tsv file before uploading.
- Reference Genome path - genome.chr21.fa
- Reference Known Variant path - dbsnp.vcf.gz
- Enter Threads and proceed with the steps given in the section 9.5
Somatic variant calling
- Click upload from Folder
- Sample Folder path - dna_somatic_samples
- Reference Genome path - hg38.fa
- Reference Known Variant path - clinvar_20191231.vcf.gz
- Enter Threads and proceed with the steps given in the section 9.5
RNA-Seq¶
- click upload from Folder
- Sample Folder path - rna_samples
- Fasta File path - hg38.chr22.fa
- Annotated File path - chr22_refGene.gtf
- Enter Threads and proceed with the steps given in the section 9.5
cTaG¶
- Path to MAF file - variants.maf
- Enter Parameters and click Run
Refer section 9.5 for more details
NBDriver¶
In order to run NBDriver, the user needs to download the hg19 reference genome from this link and put it in the "/iCOMIC/NBDriver_iCOMIC/" directory
- Path to VCF file - NBDriver_vcf.vcf
- Click Run
Refer section 9.5 for more details
9.5. Analysis quick guide¶
9.5.1 Launching the wrapper¶
iCOMIC can be launched using a simple command in the terminal.
$ icomic
9.5.2 Running iCOMIC: A quick walkthrough¶
Here is a typical set of actions to run iCOMIC pipelines:
- Select a pipeline
- Choose the mode of input
- Input the required data fields
- Proceed to the next tab if you want to skip Quality Check
- Or click on the
Quality Control Results
button to view a consolidated MultiQC report of Quality statistics - Check
yes
if you want to do trimming and also mention the additional parameters as per requirement - Tool for Quality Control: FastQC
- Tool for trimming the reads: Cutadapt
- Choose the tools of interest from
Tool selection
tab and set the parameters as required - For the choice of aligner, the corresponding genome index file needs to be uploaded if available, or the user can generate the index file using the Generate Index button
- Click
Run
on the next tab to run the analysis - Once the analysis is completed, the
Results
tab will be opened - DNA-Seq results include a MultiQC report comprising the statistics of the entire analysis, a file consisting of the variants called and the corresponding annotated variant file
- Results for RNA-Seq analysis include multiQC analysis statistics, R plots such as MA plot, Heatmap, PCA plot and box plot and list of differentially expressed genes
- Proceed to cTaG/NBDriver tab for further analysis if needed
9.5.3 Adding samples: Method one¶
iCOMIC accepts input information in two
different modes. In the first
method, the user can feed the path to a folder containing raw fastq files. For the direct upload of a sample folder, the folder should contain only the samples and the sample file names should be in a specific format:
{sample_name}_{condition}_Rep{replicate_number}_R{1 / 2}.fastq
{sample_name}
should be replaced with the sample name{condition}
should be replaced with the nature of the sample, normal or tumor. If you are using a germline variant calling pipeline, the condition should be normal for all the samples{replicate_number}
should be replaced by the number of replicate- If the sample is paired end,
{1 / 2}
should be replaced by 1 or 2 accordingly for forward and reverse sequences. If the sample is single end, {1 / 2} can be replaced by 1
9.5.4 Adding samples: Method two¶
The user can provide a table consolidating particulars of raw data. The sample information should be given in a tab delimited file with a header row. The Column names should be:
- Sample : The Sample name
- Unit : The number of replicates
- Condition : Nature of the sample, normal or tumor
- fq1 : The path of Read 1
- fq2 : Path of Read 2, if you are working with single-end reads only, the ‘fq2’ column can be left blank
9.5.5 Adding samples: specifying DNA-seq workflow¶
The significant obligation is raw fastq files which can either be single-end or paired-end. Fastq read details can be specified in two different methods, either by uploading a folder containing the reads or using a tab-separated file describing the reads as specified in the previous sections. Other input requirements and the file specifications are as mentioned:
- Samples Folder : Path to the folder containing samples satisfying the conditions mentioned in section 3
- Samples Table : Path to the tsv file generated according to instructions in section 4 as an alternative to Samples folder
- Reference Genome : Path to the reference genome. The file should have an extension .fa
- Reference Known Variant : Path to the reference known variants file. The file should be a bgzipped vcf
- Maximum threads : The maximum number of threads that can be used for running each tool. Once all the fields are filled, you can proceed to the Quality Control tab using the next button.
Figure 1: Input tab of DNA Seq pipeline
9.5.6 Adding samples: specifying RNA-seq workflow¶
Similar to the DNA-Seq pipeline, the major requirement here is also raw fastq files, either single-end or paired-end. Fastq read details can be specified in two different methods, either by uploading a folder containing the reads or using a tab-separated file describing the reads, as specified in the previous sections. Other input requirements and file specifications for RNA-seq workflow are mentioned below.
- Samples Folder : Path to the folder containing samples satisfying the conditions mentioned in section 3
- Samples Table : Path to the tsv file generated according to instructions in section 4 as an alternative to Samples folder
- Fasta file : Path to the reference genome. The file should have an extension .fa
- Annotated file : Path to the gtf annotation file
- Maximum threads : The maximum number of threads that can be used for running each tool. Once all the fields are filled, users can proceed to the Quality Control tab using the next button.
Figure 2: Input tab of RNA Seq pipeline
9.5.7 Review of Sample quality¶
In the Quality Control widget, you can examine the quality of your samples for analysis by clicking on the Quality Control Results button. The tool MultiQC provides a consolidated report of Quality statistics generated by FastQC for all the samples. Additionally, iCOMIC permits you to trim the reads using Cutadapt if required.
However, it is also possible to move ahead without going through the Quality Check process. The Quality Control widget is more or less identical for DNA and RNA seq workflows.
Figure 3: Quality Control tab of DNA Seq pipeline
Figure 4: Quality Control tab of RNA Seq pipeline
9.5.8 Specifying analysis settings DNA seq¶
DNA-Seq constitutes the Whole Genome/Exome Sequencing data analysis pipeline which permits the user to call variants from the input samples and annotate them. iCOMIC integrates a combination of 3 aligners, 5 variant callers and 2 annotators along with the tools for Quality control. The tool MultiQC is incorporated to render comprehensive analysis statistics.
In the tool selection widget, you will be asked to choose your desired set of tools for analysis.
- Aligner
You can choose a software for sequence alignment from the drop down menu. You will also need to input the genome index corresponding to the choice of aligner. iCOMIC allows you to generate the required index using the Generate index button. One will have the permission to change the values for the mandatory parameters displayed. Moreover, if you are an expert bioinformatician, iCOMIC allows you to play around with the advanced parameters. Clicking on the Advanced button would open a pop-up of all the parameters associated with a tool.
Figure 5: Tools tab of DNA Seq pipeline
- Variant Caller
This section permits you to choose a variant caller from the set of tools integrated. If the input sample is normal-tumor specific, then only those tools which call variants comparing the normal and tumor samples will be displayed. On the other hand, if you want to call variants corresponding to the reference genome, variant callers of that type would be displayed. iCOMIC allows you to set mandatory as well as advanced parameters for the selected tool.
- Annotator
This section allows you to choose a tool for annotating your called variants and specify the parameters.
Figure 6: Tools tab of DNA Seq pipeline
9.5.9 Setting up differential gene expression analysis¶
RNA-Seq part allows you to identify the differentially expressed genes from RNA Sequencing data. iCOMIC integrates a combination of 2 aligners, 2 expression modellers and 2 differential expression tools along with the tools for Quality control. The tool MultiQC is incorporated to render comprehensive analysis statistics.
Available pipelines:
- HISAT2-StringTie-ballgown
- STAR-StringTie-ballgown
- HISAT2-HTSeq-DESeq2
- STAR-HTSeq-DESeq2
- Aligner
You can choose a software for sequence alignment from the drop down menu. You will also need to input the genome index corresponding to the choice of aligner. No worries! iCOMIC allows you to generate the required index using the Generate index button. One will have the permission to change the values for the mandatory parameters displayed. Moreover, if you are an expert bioinformatician, iCOMIC allows you to play around with the advanced parameters. Clicking on the Advanced button would open a pop-up of all the parameters associated with a tool.
Figure 7: Tools tab of RNA Seq pipeline
- Expression Modeller
This section allows you to choose an expression modeller from the integrated list of tools for counting the reads with the help of annotation file. Users will also have the freedom to set parameters corresponding to the tool.
- Differential Expression tool
Here you can choose a tool for quantifying differential expression and can also set parameters.
Figure 8: Tools tab of RNA Seq pipeline
9.5.10 Submitting the analysis¶
The Run tab consists of a Run button to initialize and proceed with the analysis. Progress bar present in the tab allows you to examine the extent to which the process has been completed.
Figure 9: Run tab of DNA Seq pipeline
Figure 10: Run tab of RNA Seq pipeline
9.5.11 Retrieving the data¶
Once the analysis is completed, iCOMIC will automatically move on to the Results tab which displays three major results.
- DNA-Seq
The results displayed for DNA seq workflow are listed below.
- Analysis Statistics
Displays a MultiQC consolidated report of overall analysis statistics. This includes FastQC reports, Alignment statistics and variant statistics.
- Variants called
On clicking this button a pop up with the vcf file of variants called will be displayed.
- Annotated variants
Displays the annotated vcf file
Figure 11: Results tab of DNA Seq pipeline
- RNA-seq
The results displayed for DNA seq workflow are listed below.
- Analysis Statistics
Displays a MultiQC consolidated report of overall analysis statistics. This includes FastQC reports and Alignment statistics
- Differentially Expressed Genes
On clicking this button a pop up with the list of differentially expressed genes will be displayed.
- Plots
Displays differentially expressed genes in R plots such as MA plot, Heatmap, PCA plot and box plot.
Figure 12: Results tab of RNA Seq pipeline
9.5.12 Analysis with BAM input¶
iCOMIC allows the user to start the analysis with aligned BAM files. For running iCOMIC with BAM files as input, the files should be sorted and stored in a folder named ‘results_dna/mapped’ or ‘results/mapped’ in the case DNA seq and RNA seq workflows respectively. The BAM files should be named in the format {sample}-{unit}-{condition}.sorted.bam
. It is advised that while choosing this approach, the input is provided as a table. The sample information should be specified as mentioned in section 3 with fq1 and fq2 columns empty.
9.5.13 Running cTaG¶
cTaG (classify TSG and OG) is a tool used to identify tumour suppressor genes (TSGs) and oncogenes (OGs) using somatic mutation data. A maf file is required to run the cTaG tool, it can either be generated from the DNA-Seq output vcf file in the results tab or browsed locally. Added to that, you can mention the parameters required to run the cTag in the parameters option provided.You can click on the run button to initialize the analysis, once the necessary files have been uploaded. Once the analysis is completed, you can click on the Results button to view the results.
Figure 13: cTaG tab
9.5.14 Running NBDriver¶
NBDriver (NEIGHBORHOOD Driver) is a tool used to differentiate between driver and passenger mutations. A vcf file is required to run NBDriver, it can either be browsed from the DNA-Seq output directory or locally. In order to run NBDriver, the user needs to download the hg19 reference genome from this link and put it in the "/icomic/NBDriver_iCOMIC/"
directory. You can click on the run button to initialize the analysis, once the necessary files have been uploaded. Once the analysis is completed, you can click on the Results button to view the results.
Figure 14: NBDriver tab
9.6 Retrieving logs¶
The Logs tab at the bottom of each section in the GUI displays the commands executed by the user. Seperate log for each tools are available inside logs folder created during the analysis. The user can check the log files at any time.