PM4NGS RNASeq Tutorial¶
RNA-seq of the lobe-less (lol) mutants in Drosophila melanogaster¶
Lobe-less (lol) encodes a long noncoding RNA (lncRNA) required for the development of Drosophila mushroom body, a center of insect memories and learning. RNA-seq was carried out in the control (w[1118]) and lol mutant flies to identify genes regulated by lol lncRNA.
From the manuscript: Genetic interactions between Protein Kinase D and Lobe mutants during eye development of Drosophila melanogaster

BioProject¶
A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project.
BioProject: PRJDB5673

SRA¶
Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data.
List of SRA samples in the BioProject

Click on the Send results to Run selector to see the SRA data in the Run selector app.

Select samples to be analyzed: DRR092341, DRR092342, DRR092343, DRR092344. Wild type, adult male head compared with Lol mutant, adult male head.

Click on Deliver Data to request NCBI to transfer the data to a cloud bucket.

GCP Instance Setup¶
Create a GCP instance¶
Create a GCP instance with 16 CPUs and 60GB of RAM (machine type: n1-standard-16). Click on Change button in the Boot disk box to select Operating system: Ubuntu, Version: Ubuntu 20.04 LTS. Click on Management to select Preemptibility: ON

Accessing the instance with SSH¶
The instance will be available with two IP address, one internal for GCP and another external accessible from Internet. Click on the SSH button to login.

After clicking the SSH button, a new windows will be open. This is a Linux terminal running directly in the instance.

Installing dependencies on the GCP instance with Ubuntu¶
Runs these commands on a terminal to prepare the instance to run PM4NGS
veraalva@instance-veraalva:~$ sudo apt-get update
veraalva@instance-veraalva:~$ sudo apt-get install docker.io python3 python3-pip python3-venv python3-dev poppler-utils gcc nodejs tree
veraalva@instance-veraalva:~$ sudo usermod -aG docker $USER
veraalva@instance-veraalva:~$ logout
Close and reopen the terminal to set the docker group in the user. Then, click on the SSH button again to re-launch the terminal.
Testing the Docker daemon¶
veraalva@instance-veraalva:~$ docker run hello-world
Docker will pull the hello-world image and run it in a container. A Hello from Docker! message is displayed in the terminal.

Installing PM4NGS¶
Read PM4NGS published manuscript in GIGAScience
What is a Python virtual environment?¶
A python virtual environment is a "new Python installation" with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (which matches the version of the binary that was used to create this environment) and can have its own independent set of installed Python packages in its site directories.
Creation of virtual environments is done by executing the command venv:
Note
python3 -m venv /path/to/new/virtual/environment
Running this command creates the target directory (creating any parent directories that don’t exist already), a common name for the target directory uses as suffix _venv. It also creates a bin subdirectory containing a copy/symlink of the Python binary/binaries. It also creates an (initially empty) lib/pythonX.Y/site-packages subdirectory for the new packages. If an existing directory is specified, it will be re-used.
Once a virtual environment has been created, it can be “activated” using a script in the virtual environment’s binary directory. The invocation of the script is platform-specific (<_venv> must be replaced by the path of the directory containing the virtual environment):
Note
$ source /path/to/new/virtual/environment/bin/activate
Installing PM4NGS¶
Creates a Python virtual environment named: pm4ngs_venv for installing PM4NGS.
veraalva@instance-veraalva:~$ which python3
/usr/bin/python3
veraalva@instance-veraalva:~$ python3 -m venv pm4ngs_venv
veraalva@instance-veraalva:~$ source pm4ngs_venv/bin/activate
(pm4ngs_venv) veraalva@instance-veraalva:~$ which python3
/home/r78v10a07/pm4ngs_venv/bin/python3
(pm4ngs_venv) veraalva@instance-1:~$ pip install wheel
(pm4ngs_venv) veraalva@instance-1:~$ pip install pm4ngs
These commands need to be executed only one time during the tutorial.
Testing PM4NGS¶
Test PM4NGS RNA-Seq module:
(pm4ngs_venv) veraalva@instance-veraalva:~$ pm4ngs-rnaseq

The Jupyter Server¶
Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. Computational notebooks are essentially laboratory notebooks for scientific computing that keep track of all steps in a research project. Jupyter notebooks can be shared with collaborators and are used to guarantee reproducibility.
The Jupyter notebook has two components. Users input programming code or text in rectangular cells in a front-end web page. The browser then passes that code to a back-end ‘kernel’, which runs the code and returns the results. The kernel reside in the server, executing all code in that computer and the front-end in the user local browser.
Read this interesting article about Jupyter Notebooks for data scientists: https://www.nature.com/articles/d41586-018-07196-1
Accessing the instance with SSH¶
Click on the SSH button to login.

After clicking the SSH button, a new windows will be open. This is a Linux terminal running directly in the instance.

Activating pm4ngs_venv virtual environment¶
Everytime we login into the instance, we need to activate the pm4ngs_venv virtual environment. This mean that we will be using the python packages installed in this virtual environment instead of the packages in the instance.
veraalva@instance-veraalva:~$ source pm4ngs_venv/bin/activate
Running the Jupyter Server¶
Open a terminal and activate the pm4ngs_venv virtual environment and run the jupyter server. As the GCP instance is a remote computer, we need to run the jupyter server with the --port and --ip options.
Creates a password for the Jupyter server:
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ jupyter notebook password
Start the Jupyter server
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ jupyter notebook --no-browser --port=8888 --ip=0.0.0.0

Open the jupyter web interface in your local computer¶
Get the External IP for your instance in the GCp Cloud console: VM instances. Then, type the address in your local browser plus the jupyter server port: :8888

PM4NGS for RNA-Seq data analysis¶
Differential expression (DE) analysis allows the comparison of RNA expression levels between multiple conditions.
The differential gene expression and GO enrichment pipeline is comprised of five steps, shown in next Table. The first step involves downloading samples from the NCBI SRA database, if necessary, or executing the pre-processing quality control tools on local samples. Subsequently, sample trimming, alignment, and quantification processes are executed. Once all samples are processed, groups of differentially expressed genes are identified per condition, using DESeq and EdgeR. Over- and under-expressed genes are reported by each program, and the interception of their results is computed.
Finally, once differentially expressed genes are identified, a GO enrichment analysis is executed to provide key biological processes, molecular functions, and cellular components for identified genes.
Step |
Jupyter Notebook |
Workflow CWL |
Tool |
Input |
Output |
CWL Tool |
---|---|---|---|---|---|---|
Sample Download and Quality Control |
SRA-Tools |
SRA accession |
Fastq |
|||
FastQC |
Fastq |
FastQC HTML and Zip |
||||
Trimming |
Trimmomatic |
Fastq |
Fastq |
|||
Alignments and Quantification |
STAR |
Fastq |
BAM |
|||
Samtools |
SAM |
BAM Sorted BAM BAM index BAM stats BAM flagstats |
||||
IGVtools |
Sorted BAM |
TDF |
||||
RSeQC |
Sorted BAM |
Alignment quality control (TXT and PDF) |
||||
TPMCalculator |
Sorted BAM |
Read counts (TSV) TPM values (TSV) |
||||
Differential Gene Analysis |
DeSEq2 |
Read Matrix |
TSV |
|||
EdgeR |
Read Matrix |
TSV |
||||
Go enrichment |
gene IDs |
TSV |
Python code in the notebook |
Sample sheet¶
Sample Sheet for Drosophila BioProject PRJDB5673
sample_name |
file |
description |
replicate |
condition |
---|---|---|---|---|
DRR092341 |
Wild type, adult male head, replicate 1 |
1 |
WTAdult |
|
DRR092342 |
Wild type, adult male head, replicate 2 |
2 |
WTAdult |
|
DRR092343 |
Lol mutant, adult male head, replicate 1 |
1 |
LolAdult |
|
DRR092344 |
Lol mutant, adult male head, replicate 2 |
2 |
LolAdult |
Download the sample sheet file to your local computer: sample_sheet_adult.csv
Upload the sample sheet file to the GCP instance through the Jupyter server


Creating the PM4NGS RNA-Seq project¶
Open a terminal to execute pm4ngs-rnaseq. This command will create an organizational data structure for the analysis.

In the terminal, activate the pm4ngs_venv virtual environment and execute the pm4ngs-rnaseq command.
The pm4ngs-rnaseq command line executed with the --sample-sheet option will let you type the different variables required for creating and configuring the project. The default value for each variable is shown in the brackets. After all questions are answered, the CWL workflow files will be cloned from the github repo ncbi/cwl-ngs-workflows-cbb to the folder bin/cwl.
r78v10a07@instance-veraalva:~$ source pm4ngs_venv/bin/activate
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ ls
pm4ngs_venv sample_sheet_adult.csv
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ pm4ngs-rnaseq --sample-sheet sample_sheet_adult.csv
Note
- author_name:
Use your full name
- email:
Use your email
- project_name:
Name of the project with no space nor especial characters. This will be used as project folder's name.
Use: Dros_lol_mut
- dataset_name:
Dataset to process name with no space nor especial characters. This will be used as folder name to group the data. This folder will be created under the data/{{dataset_name}} and results/{{dataset_name}}.
Use: PRJDB5673
- is_data_in_SRA:
If the data is in the SRA set this to y. A CWL workflow to download the data from the SRA database to the folder data/{{dataset_name}} and execute FastQC on it will be included in the 01 - Pre-processing QC.ipynb notebook.
Set this option to n, if the fastq files names and location are included in the sample sheet.
Use: y
- Select sequencing_technology:
Select one of the available sequencing technologies in your data.
Values: 1 - single-end, 2 - paired-end
Use: 1
- create_demo:
If the data is downloaded from the SRA and this option is set to y, only the number of spots specified in the next variable will be downloaded. Useful to test the workflow.
Use: n
- number_spots:
Number of sport to download from the SRA database. It is ignored is the create_demo is set to n.
Press Enter
- organism:
Organism to process, e.g. human. This is used to link the selected genes to the NCBI gene database.
Use: drosophila
- genome_name:
Genome name , e.g. hg38 or mm10.
Use: dm6
- genome_dir:
Absolute path to the directory with the genome annotation (genome.fa, genes.gtf) to be used by the workflow or the name of the genome.
If the name of the genome is used, PM4NGS will include a cell in the 03 - Alignments and Quantification.ipynb notebook to download the genome files. The genome data will be at data/{{dataset_name}}/{{genome_name}}/
Press Enter
- aligner_index_dir:
Absolute path to the directory with the genome indexes for STAR.
If {{genome_name}}/STAR is used, PM4NGS will include a cell in the 03 - Alignments and Quantification.ipynb notebook to create the genome indexes for STAR.
Press Enter
- genome_fasta:
Absolute path to the genome fasta file
If {{genome_name}}/genome.fa is used, PM4NGS will use the downloaded fasta file.
Press Enter
- genome_gtf:
Absolute path to the genome GTF file
If {{genome_name}}/genes.gtf is used, PM4NGS will use the downloaded GTF file.
Press Enter
- genome_bed:
Absolute path to the genome BED file
If {{genome_name}}/genes.bed is used, PM4NGS will use the downloaded BED file.
Press Enter
- fold_change:
A real number used as fold change cutoff value for the DG analysis, e.g. 2.0.
Press Enter
- fdr:
Adjusted P-Value to be used as cutoff in the DG analysis, e.g. 0.05.
Press Enter
- use_docker:
Set this to y if you will be using Docker. Otherwise Conda needs to be installed in the computer.
Press Enter
- max_number_threads:
Number of threads available in the computer.
Press Enter

The project organizational data structure is:

Pre-processing QC¶
The first notebook download the SRA data using the accessions defined in the sample sheet. Execute all cells until the Retrieving data using fastq-dump. This cell will submit the CWL workflow. Open a terminal to check that the fastq-dump command is working.

The CWL workflow for this step is: download_quality_control.cwl

Once all cells are execute completely the fastq samples will be available at the data/PRJDB5673 directory. Run the tree command to visualize the data structure.
(pm4ngs_venv) r78v10a07@instance-veraalva:~$ tree -L 3 Dros_lol_mut/

The pre-processing table is available in the 00 - Project Report notebook. The links in the table gives access to the FastQC reports for each sample. The reports are used to select the trimming parameters.




Samples trimming¶
Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line.
The current trimming steps are:
ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING: Cut bases off the start of a read, if below a threshold quality
TRAILING: Cut bases off the end of a read, if below a threshold quality
CROP: Cut the read to a specified length
HEADCROP: Cut the specified number of bases from the start of the read
MINLEN: Drop the read if it is below a specified length
TOPHRED33: Convert quality scores to Phred-33
TOPHRED64: Convert quality scores to Phred-64
It works with FASTQ (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used), either uncompressed or gzipp'ed FASTQ. Use of gzip format is determined based on the .gz extension.
PM4NGS uses standard options for Trimmomatic but in this example we need to add HEADCROP:10 as it is shown in next figure:

The CWL workflow for this step is: trimming-qc-se.cwl

The result files for the trimming workflow are available at: results/PRJDB5673/trimmomatic

Updating the 00 - Project Report notebook.

Check the FastQC reports to check if the trimming reduced the distortion in the first 10 bases.

Alignment and Quantification¶
The alignment and quantification workflow uses STAR as aligner, Samtools for filtering and sorting BAM files, TPMCalculator for RNA-Seq quantification, RSeQC for post processing quality control and IGVtools for creating visualization files.
In this tutorial we are analysing Drosophila samples. Therefore, we need the Drosophila genome sequence and annotations. PM4NGS provides the dm6 genome pre-formatted for the alignment and quantification workflow. For a complete list of PM4NGS pre-formatted genomes see: https://pm4ngs.readthedocs.io/en/latest/pipelines/genomes.html

This workflow is the most time consuming part and require setting proper computer resources like number of cores and RAM. For this tutorial we need at least 64 GB of RAM and 16 cores. GCP machine type n1-standard-16 provides those resources.

The CWL workflow for this step is: rnaseq-alignment-quantification.cwl

It is based in another workflow for the alignment: star-alignment.cwl

Once finished, the alignment and quantification workflow will create for each sample these files:
Note
STAR alignment stats
DRR092341Aligned.out.stats
DRR092341Log.final.out
DRR092341ReadsPerGene.out.tab
Alignments files sorted from SAMtools
DRR092341_sorted.bam
DRR092341_sorted.bam.bai
Read counts and TPMs from TPMCalculator
DRR092341_sorted_genes.ent.gz
DRR092341_sorted_genes.out.gz
DRR092341_sorted_genes.uni.gz
DRR092341_sorted_transcripts.ent.gz
DRR092341_sorted_transcripts.out.gz
** Post-processing QC from RSeQC**
DRR092341_sorted_rseqc.bam_stat.txt
DRR092341_sorted_rseqc.infer_experiment.txt
DRR092341_sorted_rseqc.junction.bed.gz
DRR092341_sorted_rseqc.junction.xls.gz
DRR092341_sorted_rseqc.junctionSaturation_plot.pdf
DRR092341_sorted_rseqc.read_distribution.txt
DRR092341_sorted_rseqc.splice_events.pdf
DRR092341_sorted_rseqc.splice_junction.pdf

Updating the 00 - Project Report notebook.


The quantification values read counts and TPM are shown in a boxplot for easy comparison.


Differential Gene Expression Analysis¶
The DGE analysis is executed in the 04 - DGA notebook. The workflow uses DESeq2 and EdgeR for the identification of differentially expressed genes. R scripts are available at deseq2-2conditions.cwl and edgeR-2conditions.cwl.
In our approach, genes are designated as differentially expressed if they are identified by DESeq2 and EdgeR. Those genes are shown under the Intersection results.
results/PRJDB5673/dga/
├── WTAdult_vs_LolAdult_deseq2_dga.log
├── WTAdult_vs_LolAdult_edge_dga.log
├── WTAdult_vs_LolAdult_heatmap_union.log
├── WTAdult_vs_LolAdult_volcano_union.log
├── condition_WTAdult_vs_LolAdult_deseq2.csv
├── condition_WTAdult_vs_LolAdult_deseq2_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_pca.csv
├── condition_WTAdult_vs_LolAdult_deseq2_pca.pdf
├── condition_WTAdult_vs_LolAdult_deseq2_volcano.pdf
├── condition_WTAdult_vs_LolAdult_edgeR.csv
├── condition_WTAdult_vs_LolAdult_edgeR_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_pca.csv
├── condition_WTAdult_vs_LolAdult_edgeR_pca.pdf
├── condition_WTAdult_vs_LolAdult_edgeR_volcano.pdf
├── condition_WTAdult_vs_LolAdult_intersection.csv
├── condition_WTAdult_vs_LolAdult_intersection_correlation_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_intersection_expression_heatmap.pdf
├── condition_WTAdult_vs_LolAdult_intersection_over-expressed.csv
├── condition_WTAdult_vs_LolAdult_intersection_pca.pdf
├── condition_WTAdult_vs_LolAdult_intersection_under-expressed.csv
└── condition_WTAdult_vs_LolAdult_intersection_volcano.pdf
0 directories, 23 files
Updating the 00 - Project Report notebook.

DGE plots are automatically generated.
Volcano Plots¶

Sample correlation¶

Expression correlation¶

PCA plots¶

List of differentially expressed genes¶

Gene Ontology Enrichment¶
Gene Ontoogy enrichment analysis is executed with an in-house developed python package named goenrichment. This tool creates a database by integrating information from Gene Ontology, NCBI Gene using the files gene_info and gene2go. Read more in the package web page to create a database for another organism.
Available databases can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/goenrichment/
Next image shows the pre-defined code in PM4NGS for downloading the GO enrichment database for Drosophila. Current version of the database contains 43 987 GO terms cros-referenced to 13 741 genes.

The GO enrichment tool uses three parameters to identify differential GO terms between the two conditions.
Note
min_category_depth
Min GO term graph depth to include in the report. Default: 4
min_category_size
Min number of gene in a GO term to include in the report. Default: 3
max_category_size
Max number of gene in a GO term to include in the report. Default: 500 This 500 is good for very well annotated organism as human or mouse.
In the case of Drosophila use 10 to get relevant results.
In the Jupyter Notebook change the values to

Updating the 00 - Project Report notebook.

The list of differential GO terms are available in the tables per GO name space.



Introduction¶
This tutorial will show how to create and configure a Google Cloud Platform instance for Next Generation Sequencing (NGS) data analysis. PM4NGS is a program manager for NGS data analysis that is designed to generate a standard organizational structure including directory structures for the project, Jupyter notebooks for data management and CWL workflows for the pipeline execution.
Background reading¶
Vera Alvarez R, Pongor LS, Mariño-Ramírez L and Landsman D. PM4NGS, a project management framework for next-generation sequencing data analysis, GigaScience, Volume 10, Issue 1, January 2021, giaa141, https://doi.org/10.1093/gigascience/giaa141
Vera Alvarez R, Mariño-Ramírez L and Landsman D. Transcriptome annotation in the cloud: complexity, best practices, and cost, GigaScience, Volume 10, Issue 2, February 2021, giaa163, https://doi.org/10.1093/gigascience/giaa163
Help and Support¶
For feature requests or bug reports, please open an issue on our GitHub Repository.