PM4NGS RNASeq Tutorial

RNA-seq of the lobe-less (lol) mutants in Drosophila melanogaster

Lobe-less (lol) encodes a long noncoding RNA (lncRNA) required for the development of Drosophila mushroom body, a center of insect memories and learning. RNA-seq was carried out in the control (w[1118]) and lol mutant flies to identify genes regulated by lol lncRNA.

From the manuscript: Genetic interactions between Protein Kinase D and Lobe mutants during eye development of Drosophila melanogaster

_images/lol.png

BioProject

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project.

BioProject: PRJDB5673

BioProject: PRJDB5673

SRA

Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data.

List of SRA samples in the BioProject

BioProject: PRJDB5673

Click on the Send results to Run selector to see the SRA data in the Run selector app.

BioProject: PRJDB5673

Select samples to be analyzed: DRR092341, DRR092342, DRR092343, DRR092344. Wild type, adult male head compared with Lol mutant, adult male head.

BioProject: PRJDB5673

Click on Deliver Data to request NCBI to transfer the data to a cloud bucket.

BioProject: PRJDB5673

GCP Instance Setup

Create a GCP instance

Create a GCP instance with 16 CPUs and 60GB of RAM (machine type: n1-standard-16). Click on Change button in the Boot disk box to select Operating system: Ubuntu, Version: Ubuntu 20.04 LTS. Click on Management to select Preemptibility: ON

GCP instance creation

Accessing the instance with SSH

The instance will be available with two IP address, one internal for GCP and another external accessible from Internet. Click on the SSH button to login.

Instances

After clicking the SSH button, a new windows will be open. This is a Linux terminal running directly in the instance.

Linux terminal

Installing dependencies on the GCP instance with Ubuntu

Runs these commands on a terminal to prepare the instance to run PM4NGS

veraalva@instance-veraalva:~$ sudo apt-get update
veraalva@instance-veraalva:~$ sudo apt-get install docker.io python3 python3-pip python3-venv python3-dev poppler-utils gcc nodejs tree
veraalva@instance-veraalva:~$ sudo usermod -aG docker $USER
veraalva@instance-veraalva:~$ logout

Close and reopen the terminal to set the docker group in the user. Then, click on the SSH button again to re-launch the terminal.

Testing the Docker daemon

veraalva@instance-veraalva:~$ docker run hello-world

Docker will pull the hello-world image and run it in a container. A Hello from Docker! message is displayed in the terminal.

Testing Docker

Installing PM4NGS

Read PM4NGS published manuscript in GIGAScience

What is a Python virtual environment?

A python virtual environment is a "new Python installation" with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (which matches the version of the binary that was used to create this environment) and can have its own independent set of installed Python packages in its site directories.

Creation of virtual environments is done by executing the command venv:

Note

python3 -m venv /path/to/new/virtual/environment

Running this command creates the target directory (creating any parent directories that don’t exist already), a common name for the target directory uses as suffix _venv. It also creates a bin subdirectory containing a copy/symlink of the Python binary/binaries. It also creates an (initially empty) lib/pythonX.Y/site-packages subdirectory for the new packages. If an existing directory is specified, it will be re-used.

Once a virtual environment has been created, it can be “activated” using a script in the virtual environment’s binary directory. The invocation of the script is platform-specific (<_venv> must be replaced by the path of the directory containing the virtual environment):

Note

$ source /path/to/new/virtual/environment/bin/activate

Installing PM4NGS

Creates a Python virtual environment named: pm4ngs_venv for installing PM4NGS.

veraalva@instance-veraalva:~$ which python3
/usr/bin/python3
veraalva@instance-veraalva:~$ python3 -m venv pm4ngs_venv
veraalva@instance-veraalva:~$ source pm4ngs_venv/bin/activate
(pm4ngs_venv) veraalva@instance-veraalva:~$ which python3
/home/r78v10a07/pm4ngs_venv/bin/python3
(pm4ngs_venv) veraalva@instance-1:~$ pip install wheel
(pm4ngs_venv) veraalva@instance-1:~$ pip install pm4ngs

These commands need to be executed only one time during the tutorial.

Testing PM4NGS

Test PM4NGS RNA-Seq module:

(pm4ngs_venv) veraalva@instance-veraalva:~$ pm4ngs-rnaseq
Testing PM4NGS

The Jupyter Server

Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. Computational notebooks are essentially laboratory notebooks for scientific computing that keep track of all steps in a research project. Jupyter notebooks can be shared with collaborators and are used to guarantee reproducibility.

The Jupyter notebook has two components. Users input programming code or text in rectangular cells in a front-end web page. The browser then passes that code to a back-end ‘kernel’, which runs the code and returns the results. The kernel reside in the server, executing all code in that computer and the front-end in the user local browser.

Read this interesting article about Jupyter Notebooks for data scientists: https://www.nature.com/articles/d41586-018-07196-1

Accessing the instance with SSH

Click on the SSH button to login.

Instances

After clicking the SSH button, a new windows will be open. This is a Linux terminal running directly in the instance.

Linux terminal

Activating pm4ngs_venv virtual environment

Everytime we login into the instance, we need to activate the pm4ngs_venv virtual environment. This mean that we will be using the python packages installed in this virtual environment instead of the packages in the instance.

veraalva@instance-veraalva:~$ source pm4ngs_venv/bin/activate

Running the Jupyter Server

Open a terminal and activate the pm4ngs_venv virtual environment and run the jupyter server. As the GCP instance is a remote computer, we need to run the jupyter server with the --port and --ip options.

Creates a password for the Jupyter server:

(pm4ngs_venv) r78v10a07@instance-veraalva:~$ jupyter notebook password

Start the Jupyter server

(pm4ngs_venv) r78v10a07@instance-veraalva:~$ jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
Start the jupyter notebook server

Open the jupyter web interface in your local computer

Get the External IP for your instance in the GCp Cloud console: VM instances. Then, type the address in your local browser plus the jupyter server port: :8888

Jupyter web interface

PM4NGS for RNA-Seq data analysis

Differential expression (DE) analysis allows the comparison of RNA expression levels between multiple conditions.

The differential gene expression and GO enrichment pipeline is comprised of five steps, shown in next Table. The first step involves downloading samples from the NCBI SRA database, if necessary, or executing the pre-processing quality control tools on local samples. Subsequently, sample trimming, alignment, and quantification processes are executed. Once all samples are processed, groups of differentially expressed genes are identified per condition, using DESeq and EdgeR. Over- and under-expressed genes are reported by each program, and the interception of their results is computed.

Finally, once differentially expressed genes are identified, a GO enrichment analysis is executed to provide key biological processes, molecular functions, and cellular components for identified genes.

DGA and Go enrichment, RNASeq pipeline

Step

Jupyter Notebook

Workflow CWL

Tool

Input

Output

CWL Tool

Sample Download and Quality Control

Pre-processing QC

download_quality_control.cwl

SRA-Tools

SRA accession

Fastq

fastq-dump.cwl

FastQC

Fastq

FastQC HTML and Zip

fastqc.cwl

Trimming

Samples trimming

Trimmomatic

Fastq

Fastq

trimmomatic-PE.cwl

trimmomatic-SE.cwl

Alignments and Quantification

Alignment and Quantification

rnaseq-alignment-quantification.cwl

STAR

Fastq

BAM

star.cwl

Samtools

SAM

BAM

Sorted BAM

BAM index

BAM stats

BAM flagstats

samtools-view.cwl

samtools-sort.cwl

samtools-index.cwl

samtools-stats.cwl

samtools-flagstat.cwl

IGVtools

Sorted BAM

TDF

igvtools-count.cw

RSeQC

Sorted BAM

Alignment quality

control

(TXT and PDF)

rseqc-bam_stat.cwl

rseqc-infer_experiment.cwl

rseqc-junction_annotation.cwl

rseqc-junction_saturation.cwl

rseqc-read_distribution.cwl

rseqc-read_quality.cwl

TPMCalculator

Sorted BAM

Read counts (TSV)

TPM values (TSV)

tpmcalculator.cwl

Differential Gene Analysis

Differential Gene Expression Analysis

DeSEq2

Read Matrix

TSV

deseq2-2conditions.cwl

EdgeR

Read Matrix

TSV

edgeR-2conditions.cwl

Go enrichment

Gene Ontology Enrichment

goenrichment

gene IDs

TSV

Python code in the notebook

Sample sheet

Sample Sheet for Drosophila BioProject PRJDB5673

sample_name

file

description

replicate

condition

DRR092341

Wild type, adult male head, replicate 1

1

WTAdult

DRR092342

Wild type, adult male head, replicate 2

2

WTAdult

DRR092343

Lol mutant, adult male head, replicate 1

1

LolAdult

DRR092344

Lol mutant, adult male head, replicate 2

2

LolAdult

Download the sample sheet file to your local computer: sample_sheet_adult.csv

Upload the sample sheet file to the GCP instance through the Jupyter server

_images/jupyter-2.png _images/jupyter-3.png

Creating the PM4NGS RNA-Seq project

Open a terminal to execute pm4ngs-rnaseq. This command will create an organizational data structure for the analysis.

_images/jupyter-4.png

In the terminal, activate the pm4ngs_venv virtual environment and execute the pm4ngs-rnaseq command.

The pm4ngs-rnaseq command line executed with the --sample-sheet option will let you type the different variables required for creating and configuring the project. The default value for each variable is shown in the brackets. After all questions are answered, the CWL workflow files will be cloned from the github repo ncbi/cwl-ngs-workflows-cbb to the folder bin/cwl.

r78v10a07@instance-veraalva:~$ source pm4ngs_venv/bin/activate
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ ls
pm4ngs_venv  sample_sheet_adult.csv
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ pm4ngs-rnaseq --sample-sheet sample_sheet_adult.csv

Note

  • author_name:

    Use your full name

  • email:

    Use your email

  • project_name:

    Name of the project with no space nor especial characters. This will be used as project folder's name.

    Use: Dros_lol_mut

  • dataset_name:

    Dataset to process name with no space nor especial characters. This will be used as folder name to group the data. This folder will be created under the data/{{dataset_name}} and results/{{dataset_name}}.

    Use: PRJDB5673

  • is_data_in_SRA:

    If the data is in the SRA set this to y. A CWL workflow to download the data from the SRA database to the folder data/{{dataset_name}} and execute FastQC on it will be included in the 01 - Pre-processing QC.ipynb notebook.

    Set this option to n, if the fastq files names and location are included in the sample sheet.

    Use: y

  • Select sequencing_technology:

    Select one of the available sequencing technologies in your data.

    Values: 1 - single-end, 2 - paired-end

    Use: 1

  • create_demo:

    If the data is downloaded from the SRA and this option is set to y, only the number of spots specified in the next variable will be downloaded. Useful to test the workflow.

    Use: n

  • number_spots:

    Number of sport to download from the SRA database. It is ignored is the create_demo is set to n.

    Press Enter

  • organism:

    Organism to process, e.g. human. This is used to link the selected genes to the NCBI gene database.

    Use: drosophila

  • genome_name:

    Genome name , e.g. hg38 or mm10.

    Use: dm6

  • genome_dir:

    Absolute path to the directory with the genome annotation (genome.fa, genes.gtf) to be used by the workflow or the name of the genome.

    If the name of the genome is used, PM4NGS will include a cell in the 03 - Alignments and Quantification.ipynb notebook to download the genome files. The genome data will be at data/{{dataset_name}}/{{genome_name}}/

    Press Enter

  • aligner_index_dir:

    Absolute path to the directory with the genome indexes for STAR.

    If {{genome_name}}/STAR is used, PM4NGS will include a cell in the 03 - Alignments and Quantification.ipynb notebook to create the genome indexes for STAR.

    Press Enter

  • genome_fasta:

    Absolute path to the genome fasta file

    If {{genome_name}}/genome.fa is used, PM4NGS will use the downloaded fasta file.

    Press Enter

  • genome_gtf:

    Absolute path to the genome GTF file

    If {{genome_name}}/genes.gtf is used, PM4NGS will use the downloaded GTF file.

    Press Enter

  • genome_bed:

    Absolute path to the genome BED file

    If {{genome_name}}/genes.bed is used, PM4NGS will use the downloaded BED file.

    Press Enter

  • fold_change:

    A real number used as fold change cutoff value for the DG analysis, e.g. 2.0.

    Press Enter

  • fdr:

    Adjusted P-Value to be used as cutoff in the DG analysis, e.g. 0.05.

    Press Enter

  • use_docker:

    Set this to y if you will be using Docker. Otherwise Conda needs to be installed in the computer.

    Press Enter

  • max_number_threads:

    Number of threads available in the computer.

    Press Enter

_images/jupyter-5.png

The project organizational data structure is:

_images/jupyter-6.png

Pre-processing QC

The first notebook download the SRA data using the accessions defined in the sample sheet. Execute all cells until the Retrieving data using fastq-dump. This cell will submit the CWL workflow. Open a terminal to check that the fastq-dump command is working.

_images/jupyter-7.png

The CWL workflow for this step is: download_quality_control.cwl

_images/cwl-1.png

Once all cells are execute completely the fastq samples will be available at the data/PRJDB5673 directory. Run the tree command to visualize the data structure.

(pm4ngs_venv) r78v10a07@instance-veraalva:~$ tree -L 3 Dros_lol_mut/
_images/jupyter-8.png

The pre-processing table is available in the 00 - Project Report notebook. The links in the table gives access to the FastQC reports for each sample. The reports are used to select the trimming parameters.

_images/jupyter-9.png _images/jupyter-10.png _images/jupyter-11.png _images/jupyter-12.png

Samples trimming

Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line.

The current trimming steps are:

  • ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.

  • SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.

  • LEADING: Cut bases off the start of a read, if below a threshold quality

  • TRAILING: Cut bases off the end of a read, if below a threshold quality

  • CROP: Cut the read to a specified length

  • HEADCROP: Cut the specified number of bases from the start of the read

  • MINLEN: Drop the read if it is below a specified length

  • TOPHRED33: Convert quality scores to Phred-33

  • TOPHRED64: Convert quality scores to Phred-64

It works with FASTQ (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used), either uncompressed or gzipp'ed FASTQ. Use of gzip format is determined based on the .gz extension.

PM4NGS uses standard options for Trimmomatic but in this example we need to add HEADCROP:10 as it is shown in next figure:

_images/jupyter-13.png

The CWL workflow for this step is: trimming-qc-se.cwl

_images/cwl-2.png

The result files for the trimming workflow are available at: results/PRJDB5673/trimmomatic

_images/terminal-5.png

Updating the 00 - Project Report notebook.

_images/jupyter-14.png

Check the FastQC reports to check if the trimming reduced the distortion in the first 10 bases.

_images/jupyter-15.png

Alignment and Quantification

The alignment and quantification workflow uses STAR as aligner, Samtools for filtering and sorting BAM files, TPMCalculator for RNA-Seq quantification, RSeQC for post processing quality control and IGVtools for creating visualization files.

In this tutorial we are analysing Drosophila samples. Therefore, we need the Drosophila genome sequence and annotations. PM4NGS provides the dm6 genome pre-formatted for the alignment and quantification workflow. For a complete list of PM4NGS pre-formatted genomes see: https://pm4ngs.readthedocs.io/en/latest/pipelines/genomes.html

_images/jupyter-16.png

This workflow is the most time consuming part and require setting proper computer resources like number of cores and RAM. For this tutorial we need at least 64 GB of RAM and 16 cores. GCP machine type n1-standard-16 provides those resources.

_images/jupyter-17.png

The CWL workflow for this step is: rnaseq-alignment-quantification.cwl

_images/cwl-3.png

It is based in another workflow for the alignment: star-alignment.cwl

_images/cwl-4.png

Once finished, the alignment and quantification workflow will create for each sample these files:

Note

STAR alignment stats

  • DRR092341Aligned.out.stats

  • DRR092341Log.final.out

  • DRR092341ReadsPerGene.out.tab

Alignments files sorted from SAMtools

  • DRR092341_sorted.bam

  • DRR092341_sorted.bam.bai

Read counts and TPMs from TPMCalculator

  • DRR092341_sorted_genes.ent.gz

  • DRR092341_sorted_genes.out.gz

  • DRR092341_sorted_genes.uni.gz

  • DRR092341_sorted_transcripts.ent.gz

  • DRR092341_sorted_transcripts.out.gz

** Post-processing QC from RSeQC**

  • DRR092341_sorted_rseqc.bam_stat.txt

  • DRR092341_sorted_rseqc.infer_experiment.txt

  • DRR092341_sorted_rseqc.junction.bed.gz

  • DRR092341_sorted_rseqc.junction.xls.gz

  • DRR092341_sorted_rseqc.junctionSaturation_plot.pdf

  • DRR092341_sorted_rseqc.read_distribution.txt

  • DRR092341_sorted_rseqc.splice_events.pdf

  • DRR092341_sorted_rseqc.splice_junction.pdf

_images/jupyter-18.png

Updating the 00 - Project Report notebook.

_images/jupyter-19.png _images/jupyter-20.png

The quantification values read counts and TPM are shown in a boxplot for easy comparison.

_images/jupyter-21.png _images/jupyter-22.png

Differential Gene Expression Analysis

The DGE analysis is executed in the 04 - DGA notebook. The workflow uses DESeq2 and EdgeR for the identification of differentially expressed genes. R scripts are available at deseq2-2conditions.cwl and edgeR-2conditions.cwl.

In our approach, genes are designated as differentially expressed if they are identified by DESeq2 and EdgeR. Those genes are shown under the Intersection results.

results/PRJDB5673/dga/
├── WTAdult_vs_LolAdult_deseq2_dga.log
├── WTAdult_vs_LolAdult_edge_dga.log
├── WTAdult_vs_LolAdult_heatmap_union.log
├── WTAdult_vs_LolAdult_volcano_union.log
├── condition_WTAdult_vs_LolAdult_deseq2.csv
├── condition_WTAdult_vs_LolAdult_deseq2_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_pca.csv
├── condition_WTAdult_vs_LolAdult_deseq2_pca.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_volcano.pdf
├── condition_WTAdult_vs_LolAdult_edgeR.csv
├── condition_WTAdult_vs_LolAdult_edgeR_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_pca.csv
├── condition_WTAdult_vs_LolAdult_edgeR_pca.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_volcano.pdf
├── condition_WTAdult_vs_LolAdult_intersection.csv
├── condition_WTAdult_vs_LolAdult_intersection_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_intersection_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_intersection_over-expressed.csv
├── condition_WTAdult_vs_LolAdult_intersection_pca.pdf
├── condition_WTAdult_vs_LolAdult_intersection_under-expressed.csv
└── condition_WTAdult_vs_LolAdult_intersection_volcano.pdf

0 directories, 23 files

Updating the 00 - Project Report notebook.

_images/jupyter-23.png

DGE plots are automatically generated.

Volcano Plots

_images/jupyter-24.png

Sample correlation

_images/jupyter-25.png

Expression correlation

_images/jupyter-26.png

PCA plots

_images/jupyter-27.png

List of differentially expressed genes

_images/jupyter-28.png

Gene Ontology Enrichment

Gene Ontoogy enrichment analysis is executed with an in-house developed python package named goenrichment. This tool creates a database by integrating information from Gene Ontology, NCBI Gene using the files gene_info and gene2go. Read more in the package web page to create a database for another organism.

Available databases can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/goenrichment/

Next image shows the pre-defined code in PM4NGS for downloading the GO enrichment database for Drosophila. Current version of the database contains 43 987 GO terms cros-referenced to 13 741 genes.

_images/jupyter-29.png

The GO enrichment tool uses three parameters to identify differential GO terms between the two conditions.

Note

min_category_depth

Min GO term graph depth to include in the report. Default: 4

min_category_size

Min number of gene in a GO term to include in the report. Default: 3

max_category_size

Max number of gene in a GO term to include in the report. Default: 500 This 500 is good for very well annotated organism as human or mouse.

In the case of Drosophila use 10 to get relevant results.

In the Jupyter Notebook change the values to

_images/jupyter-30.png

Updating the 00 - Project Report notebook.

_images/jupyter-31.png

The list of differential GO terms are available in the tables per GO name space.

_images/jupyter-32.png _images/jupyter-33.png _images/jupyter-34.png

Introduction

This tutorial will show how to create and configure a Google Cloud Platform instance for Next Generation Sequencing (NGS) data analysis. PM4NGS is a program manager for NGS data analysis that is designed to generate a standard organizational structure including directory structures for the project, Jupyter notebooks for data management and CWL workflows for the pipeline execution.

Background reading

  1. Vera Alvarez R, Pongor LS, Mariño-Ramírez L and Landsman D. PM4NGS, a project management framework for next-generation sequencing data analysis, GigaScience, Volume 10, Issue 1, January 2021, giaa141, https://doi.org/10.1093/gigascience/giaa141

  2. Vera Alvarez R, Mariño-Ramírez L and Landsman D. Transcriptome annotation in the cloud: complexity, best practices, and cost, GigaScience, Volume 10, Issue 2, February 2021, giaa163, https://doi.org/10.1093/gigascience/giaa163

Help and Support

For feature requests or bug reports, please open an issue on our GitHub Repository.