Practical training – RNA-seq data analysis

1. Introduction
2. Connect to the server and set up your working environment
3. Sequence Quality Controls
4. Mapping of Reads on the Reference Genome
5. Alignments Visualization with a Genome Browser
6. Search for Differentially Expressed Genes
7. Usefull commands to work on the cluster

1. Introdution

Objective of this practical session:

During this practical session, you will learn:

To conduct an analysis of RNA-seq raw fastq files. You will first perform quality control of the sequencing file before proceding to the mapping of the reads on the reference transcriptome.
To perform statistical analysis of the gene expression matrices in order to identify differentialy expressed genes between two conditions

Dataset used :

Data used in these practical were collected from the following publication:

Guida, A., Lindstädt, C., Maguire, S. L., Ding, C., Higgins, D. G., Corton, N. J., Berriman, M., et al. (2011). Using RNA-seq to determine the transcriptional landscape and the hypoxic response of the pathogenic yeast Candida parapsilosis. BMC genomics Guida et al. BMC Genomics 2011

2. Connect to the server and set up your working environment

For this part of the practical, you will work remotely on the IFB-core cluster located in Orsay (IDRIS). Because it it a shared infrastructure you will have to pay attention to the command you use to start a job. They all must start with srun.
You can refer to IFB Core Cluster Documentation for more detail about logging, data management, job submission.

On Windows using MobaXterm

In Session > SSH

Remote Host : core.cluster.france-bioinformatique.fr

Specify username : ticked and filled in your login

Advanced SSH settings : tick X11-Forwarding

Enter your passwork

You should be logged on the IFB Core cluster

On MacOS and Linux

ssh <login>@core.cluster.france-bioinformatique.fr

2 - Set up your working environment

Go to your home directory
```
 cd 
```
Create a directory that will contain all results of the upcoming analyses.
```
 mkdir RNAseq
```
Go to the newly created directory
```
 cd RNAseq
```
Check your are in the right directory using pwd :
```
 pwd
 /shared/home/<your login>/RNAseq
```

3. Sequence Quality Controls

Fastq files are raw results from RNA-seq experiments. These files comprise all the sequences (or reads) obtained with the sequencer machine (Illumina technology here), together with base quality scores (PHRED scores).

Two different files will be analyzed in this practical (see Guida et al. for more information) :

O2rep2_SRR352263.fastq refereed to a transcriptome analysis of yeasts C. parapsilosis under normoxic condition.
noO2rep3_SRR352271.fastq refereed to a transcriptome analysis of yeasts C. parapsilosis under hypoxic condition

In a first step, quality controls will be perform on each FASTQ files in order to evaluate the quality of the sequences and identify potential problems that could impact the following analyses. Dedicated JAVA software will be used : FastQC 0.11.8 . Note that other software exists.

FastQC is a quality control application for high throughput sequence data.
It provides an interactive application to review the results of several different quality control checks, and create an HTML based report.
The main functions of FastQC are:

Import of data from BAM, SAM or FastQ files (any variant)

Providing a quick overview to tell you in which areas there may be problems

Summary graphs and tables to quickly assess your data

Export of results to an HTML based permanent report

Offline operation to allow automated generation of reports without running the interactive application

Create a new directory to store the output of fastqc
```
 mkdir 1-QualityControl
```
Using the tree command, your directory should look like this :
```
 /shared/home/<your login>/RNAseq
 │
 └───1-QualityControl
```
Go to this directory
```
 cd 1-QualityControl
```
Get Fastqc available in your environment
```
 module load fastqc/0.11.8
```
Check the help page of the programme to see its usage and parameters
```
 srun fastqc --help
```

Run fastqc on each experiment files

/shared/projects/ens_hts_2020/data/rnaseq/O2rep2_SRR352263.fastq is the absolute path to the first file

-o creates all output files in the specified output directory. ‘.’ means current directory.

 # O2 condition reads
 srun fastqc /shared/projects/ens_hts_2020/data/rnaseq/O2rep2_SRR352263.fastq -o .

 # noO2 condition reads
 srun fastqc /shared/projects/ens_hts_2020/data/rnaseq/noO2rep3_SRR352271.fastq -o .

At this point you should see the two new files in your directory using the tree command

 /shared/home/<your login>/RNAseq
 │
 └───1-QualityControl
     └─── O2rep2_SRR352263.fastqc.html
     └─── O2rep2_SRR352263.fastqc.zip
     └─── noO2rep3_SRR352271.fastqc.html
     └─── noO2rep3_SRR352271.fastqc.zip

Download the HTML file reports on your local machine

 ### OPEN A NEW TERMINAL
 ## Create a directory where to put generated files on your computer
 mkdir ~/Desktop/RNAseq/

 ## Go to the location on your computer, where you want to put the data, for example:
 cd ~/Desktop/RNAseq/

 ## Download html report files
 scp <login>@core.cluster.france-bioinformatique.fr:~/RNAseq/1-QualityControl/*.html .
 # Enter your password 

Open the .html report with your favorite browser

4. Mapping of Reads on the Reference Genome

Once data quality is verified, reads will be mapped onto the reference genome of yeast C. parapsilosis. The genome sequence of C. parapsilosis and its annotations (locations of ORFs) were retrieved from the CGD database.

Different aligner and algorithms for RNA-seq analysis exist. We will use Bowtie 1.2.2 an ultrafast (memory-efficient) short read aligner. As an input, Bowtie uses a Fastq file (with reads to be aligned) and “pre-built indexes” of the reference genome. These indexes are named “C_parapsilosis.1.ebwt”, “C_parapsilosis.2.ebwt”, etc. They will allow boosting the alignment process.
As an output, Bowtie provides a SAM file. SAM (Sequence Alignment/Map) is a generic format for storing large nucleotide sequence alignments.

TO DO: Run sequence alignments with Bowtie using the two FASTQ files. Take a look at Bowtie documentation and describe the specified options (-m in particular). What is the proportion of reads aligned on the reference genome?

Create a new directory to store the output of fastqc

 #Go to the parental directory "RNAseq"
 cd ../

 #Create a new directory to store results of the alignment
 mkdir 2-Mapping

Your directory should now look like this :

 /shared/home/<your login>/RNAseq
 │
 └───1-QualityControl
     └─── O2rep2_SRR352263.fastqc.html
     └─── O2rep2_SRR352263.fastqc.zip
     └─── noO2rep3_SRR352271.fastqc.html
     └─── noO2rep3_SRR352271.fastqc.zip
 └─── 2-Mapping

Go to this directory
```
 cd 2-Mapping
```
Load Bowtie into your environment
```
 module load bowtie/1.2.2
```

Map the reads to the reference genome

-S will output the result in SAM format

/shared/projects/ens_hts_2020/data/rnaseq/bowtie_indexes/C_parapsilosis specify the location and the prefix (C_parapsilosis) of the bowtie’s index files

/shared/projects/ens_hts_2020/data/rnaseq/Fastqc/O2rep2_SRR352263.fastq.gz location of the input fastq

2> will print some statistic about the aligment (#of reads mapped, etc…)

> redirects the mapping output into a .sam file

 # Map the aerobic condition reads
 srun bowtie -S /shared/projects/ens_hts_2020/data/rnaseq/bowtie_indexes/C_parapsilosis \
     /shared/projects/ens_hts_2020/data/rnaseq/O2rep2_SRR352263.fastq 2> O2rep2_SRR352263_bowtie_mapping.out > O2rep2_SRR352263_bowtie_mapping.sam

 # Map the hypoxic condition reads
 srun bowtie -S /shared/projects/ens_hts_2020/data/rnaseq/bowtie_indexes/C_parapsilosis \
     /shared/projects/ens_hts_2020/data/rnaseq/noO2rep3_SRR352271.fastq 2> noO2rep3_SRR352271_bowtie_mapping.out > noO2rep3_SRR352271_bowtie_mapping.sam

Your directory should now look like this :

/shared/home/<your login>/RNAseq
│
└───1-QualityControl
	└─── O2rep2_SRR352263.fastqc.html
	└─── O2rep2_SRR352263.fastqc.zip
	└─── noO2rep3_SRR352271.fastqc.html
	└─── noO2rep3_SRR352271.fastqc.zip
└─── 2-Mapping
	└─── O2rep2_SRR352263_bowtie_mapping.sam
	└─── O2rep2_SRR352263_bowtie_mapping.out
	└─── noO2rep3_SRR352271_bowtie_mapping.sam
	└─── noO2rep3_SRR352271_bowtie_mapping.out

5. Alignments Visualization with a Genome Browser

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based, next-generation sequence data and genomic annotations. In this practical, we will use IGV to visualize mapping results (see previous section). For that, SAM files has to be converted into BAM files (a binary version of SAM) and “sorted” according to the genomic sequence. We will use programs available in the Samtools 1.9 suite.

TO DO: Run the conversion of SAM files (obtain with BOWTIE) into BAM files and “sorted and indexed” BAM files. Run IGV.

Sort and Converte .sam into .bam files

samtools sort Sort alignments by genomic coordinates

| pipe the output of samtools sort to the next programme i.e. samtools view

samtools view will convert sam into bam

-b specify the output to be in BAM format

> write the output in the bam file

 module load samtools/1.9

 # Sort and convert O2 condition
 srun samtools sort O2rep2_SRR352263_bowtie_mapping.sam | srun samtools view -b  > O2rep2_SRR352263_bowtie_sorted.bam

 # Sort and convert noO2 condition
 srun samtools sort noO2rep3_SRR352271_bowtie_mapping.sam | srun samtools view -b  > noO2rep3_SRR352271_bowtie_sorted.bam

Create an index of the tow bam files

IGV requieres the to have an index of the bam file. The index of a bam file is name .bam.bai

 #Index the O2 condition
 srun samtools index O2rep2_SRR352263_bowtie_sorted.bam

 #Index the noO2 condition
 srun samtools index noO2rep3_SRR352271_bowtie_sorted.bam

Download the resulting bam and index files on your computer

 # To download the bam files from the cluster to your current directory (on your own computer), **open a new shell and run**

 scp  <your login>@core.cluster.france-bioinformatique.fr:~/RNAseq/2-Mapping/*.bam* .

You will also need the reference genome sequence and gene annotation files

 # To download the bam files from the cluster to your current directory (on your own computer), **open a new shell and run**

 scp  <your login>@core.cluster.france-bioinformatique.fr:/shared/projects/ens_hts_2020/data/rnaseq/C_parapsilosis_CGD.fasta .

 scp  <your login>@core.cluster.france-bioinformatique.fr:/shared/projects/ens_hts_2020/data/rnaseq/C_parapsilosis_ORFs.gff .

Visualize mapping results with IGV

Download IGV and follow the instalation instructions according to your OS

Once the IGV program is launched, it is necessary to import the reference genome “Genomes/Create .genome File…” (see below).

Select the FASTA file with the genomic sequence of C. parapsilosis “C_parapsilosis_CGD.fasta” (“Browse / FASTA file”) and enter information regarding ORFs positions, GFF file “C_parapsilosis_ORFs.gff” (“Browse Gene file”).

Finally, give a name to your genome (“Unique identifier” and “Descriptive name”) and press “OK”. Save the genome file in your home.

Warning! In order to IGV to create a index of your genome, you need to copy the reference genome FASTA file in writable directory.

C. parapsilosis genome is now loaded into IGV and can be selected from the top/left menu (see 1 below). The genomic sequence can be therefore explored, choosing for instance, a particular chromosome (see 2 below) or a genomic region (see 3). Note that gene annotations (ORF positions) are shown on the bottom of the window (see 4, blue lines) and you can obtain a more detailed view of the sequence using the cursor located on the top/right of the window, see 5).

Mapping results (“.sorted.bam” files) can now be imported (“File / Load from File”):

TO DO: Compare your results with those presented in the original publication: Did the authors use stranded-specific protocols? Can you observe differences between hypoxic and normoxic conditions?

6. Search for Differentially Expressed Genes

Gene counts estimation

To identify genes whose expression is different between hypoxic and normoxic conditions, we will count and compare the number of reads mapped to each ORF. A program available in the Bedtools 2.27.1 suite will be used.

TO DO: Calculate for each ORF the number of reads that were aligned (normoxic and hypoxic conditions).

Create a new directory to store the ORF count matrix

 #Go to the parental directory "RNAseq"
 cd ../

 #Create a new directory to store results of the alignment
 mkdir 3-ORF_reads_count

Your directory should now look like this:

 /shared/home/<your login>/RNAseq
 ├── 1-QualityControl
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 ├── 2-Mapping
 │   ├── noO2rep3_SRR352271_bowtie_mapping.out
 │   ├── noO2rep3_SRR352271_bowtie_mapping.sam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam.bai
 │   ├── O2rep2_SRR352263_bowtie_mapping.out
 │   ├── O2rep2_SRR352263_bowtie_mapping.sam
 │   ├── O2rep2_SRR352263_bowtie_sorted.bam
 │   └── O2rep2_SRR352263_bowtie_sorted.bam.bai
 └── 3-ORF_reads_count

Go to this directory
```
 cd  3-ORF_reads_count
```

Calculate for each ORF the number of reads that were aligned to it

 module load bedtools/2.27.1

 srun bedtools multicov -bams ../2-Mapping/O2rep2_SRR352263_bowtie_sorted.bam \
 -bed /shared/projects/ens_hts_2020/data/rnaseq/C_parapsilosis_ORFs.gff > O2rep2_SRR352263_gene_counts.gff

 srun sed 's/^.*ID=//' O2rep2_SRR352263_gene_counts.gff > O2rep2_SRR352263_gene_counts.tab

 srun bedtools multicov -bams ../2-Mapping/noO2rep3_SRR352271_bowtie_sorted.bam \
 -bed /shared/projects/ens_hts_2020/data/rnaseq/C_parapsilosis_ORFs.gff > noO2rep3_SRR352271_gene_counts.gff

 srun sed 's/^.*ID=//' noO2rep3_SRR352271_gene_counts.gff > noO2rep3_SRR352271_gene_counts.tab

Take a look at the product files to see how the count files are organised

At the end of RNA-seq data analysis your directory should like this :

 ├── 1-QualityControl
 │   ├── noO2rep3_SRR352271_fastqc.html
 │   ├── noO2rep3_SRR352271_fastqc.zip
 │   ├── O2rep2_SRR352263_fastqc.html
 │   └── O2rep2_SRR352263_fastqc.zip
 ├── 2-Mapping
 │   ├── noO2rep3_SRR352271_bowtie_mapping.out
 │   ├── noO2rep3_SRR352271_bowtie_mapping.sam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam
 │   ├── noO2rep3_SRR352271_bowtie_sorted.bam.bai
 │   ├── O2rep2_SRR352263_bowtie_mapping.out
 │   ├── O2rep2_SRR352263_bowtie_mapping.sam
 │   ├── O2rep2_SRR352263_bowtie_sorted.bam
 │   └── O2rep2_SRR352263_bowtie_sorted.bam.bai
 └── 3-ORF_reads_count
     ├── noO2rep3_SRR352271_gene_counts.gff
     ├── noO2rep3_SRR352271_gene_counts.tab
     ├── O2rep2_SRR352263_gene_counts.gff
     └── O2rep2_SRR352263_gene_counts.tab

Unload the tools you used

 module unload samtools/1.9 bowtie/1.2.2 bedtools/2.27.1

Statistical analysis using DEseq2 R package

In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). Results obtained for all experiments are combined in the file “/shared/projects/2020_eu_HTSdataAnalysis/rnaseq/count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 (Love et al. 2014) R package. The DESeq package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.

TO DO: Search for differentially expressed genes using DESeq R package. How many genes are selected with different p-value thresholds (5%, 1%, etc.)? Check your results with IGV and use GOtermFinder (see practical on microarrays) to analyse the function of the selected genes.

Connect to Rstudio serveur of the IFB

In a web browser, connect to https://rstudio.cluster.france-bioinformatique.fr/auth-sign-in and log in using your user name and password (same as for ssh connection)

You will reached the familiar Rstudio environment :
Save the working notebook containing all the code in your personal environment
- In File > Open File… enter the path /shared/projects/ens_hts_2020/data/rnaseq/DEseq2.Rmd to open the notebook containing all the code needed for the practical.
- Save it into your personal folder using File > Save As
Using your notebook, follow the online tutorial to conduct the DE analysis

Usefull commands to work on the cluster

These are some usefull command lines you may need to work on the cluster

You can find more detailed information and video tutorial here IFB Core Cluster Documentation

How to download/upload your data from the cluster:

 # To download file/files from the cluster to your current directory
 scp  <your login>@core.cluster.france-bioinformatique.fr:/<absolute path to your file> 

 # To download a folder from the cluster to your current directory
  scp -r  <your login>@core.cluster.france-bioinformatique.fr:/<absolute path to your folder>

 # To upload a file to the cluster
  scp <path to your local folder> <your login>@core.cluster.france-bioinformatique.fr:/<absolute path to the target folder>

To have information on your current job:
```
 squeue -u <your login>
```

To list all your running/pending jobs:

 squeue -u <your login> -t RUNNING

 squeue -u <your login> -t PENDING

To cancel/stop a job:
```
 scancel <jobid>
```
To cancel all yout jobs:
```
 scancel -u <your login>
```
To cancel all your pending jobs :
```
 scancel -t PENDING -u <your login>
```

HTS_M2-IMaLiS_2020

Practical of the M2 IMaLiS class on transcriptomic data analysis

Practical training – RNA-seq data analysis

1. Introdution

Objective of this practical session:

Dataset used :

2. Connect to the server and set up your working environment

2 - Set up your working environment

3. Sequence Quality Controls

4. Mapping of Reads on the Reference Genome

TO DO: Run sequence alignments with Bowtie using the two FASTQ files. Take a look at Bowtie documentation and describe the specified options (-m in particular). What is the proportion of reads aligned on the reference genome?

5. Alignments Visualization with a Genome Browser

TO DO: Run the conversion of SAM files (obtain with BOWTIE) into BAM files and “sorted and indexed” BAM files. Run IGV.

TO DO: Compare your results with those presented in the original publication: Did the authors use stranded-specific protocols? Can you observe differences between hypoxic and normoxic conditions?

6. Search for Differentially Expressed Genes

Gene counts estimation

TO DO: Calculate for each ORF the number of reads that were aligned (normoxic and hypoxic conditions).

Statistical analysis using DEseq2 R package

TO DO: Search for differentially expressed genes using DESeq R package. How many genes are selected with different p-value thresholds (5%, 1%, etc.)? Check your results with IGV and use GOtermFinder (see practical on microarrays) to analyse the function of the selected genes.

Usefull commands to work on the cluster

HTS_M2-IMaLiS_2020

Practical of the M2 IMaLiS class on transcriptomic data analysis

Practical training – RNA-seq data analysis

1. Introdution

Objective of this practical session:

Dataset used :

2. Connect to the server and set up your working environment

1 - Sign in on the server

2 - Set up your working environment

3. Sequence Quality Controls

TO DO: Use FASTQC to evaluate the quality of sequences in each FASTQ files. Using information from the Fastqc help page as well as exemples of good and bad illumina data as references, compare results between the two FASTQ files. Is there any concern related to the following analyses?

4. Mapping of Reads on the Reference Genome

TO DO: Run sequence alignments with Bowtie using the two FASTQ files. Take a look at Bowtie documentation and describe the specified options (-m in particular). What is the proportion of reads aligned on the reference genome?

5. Alignments Visualization with a Genome Browser

TO DO: Run the conversion of SAM files (obtain with BOWTIE) into BAM files and “sorted and indexed” BAM files. Run IGV.

TO DO: Compare your results with those presented in the original publication: Did the authors use stranded-specific protocols? Can you observe differences between hypoxic and normoxic conditions?

6. Search for Differentially Expressed Genes

Gene counts estimation

TO DO: Calculate for each ORF the number of reads that were aligned (normoxic and hypoxic conditions).

Statistical analysis using DEseq2 R package

TO DO: Search for differentially expressed genes using DESeq R package. How many genes are selected with different p-value thresholds (5%, 1%, etc.)? Check your results with IGV and use GOtermFinder (see practical on microarrays) to analyse the function of the selected genes.

Usefull commands to work on the cluster