Introduction

In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2).

Results obtained for all experiments are combined in the file “/shared/projects/2020_eu_HTSdataAnalysis/RNAseq/R/count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 R package (Love et al. 2014).

The DESeq2 package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.

TO DO : Search for differentially expressed genes using DESeq R package.

How many genes are selected with different p-value thresholds (5%, 1%, etc.) ?
Check your results with IGV and use GOtermFinder to analyse the function of the selected genes.

libraries loading

library(DESeq2)

library(dplyr)
library(reshape2)
library(ggplot2)

#Set ggplot theme as classic
theme_set(theme_classic())

Prepare the data for the analysis

Load the raw count matrix

countData <- read.table("/shared/projects/ens_hts_2020/data/rnaseq/count_data_diffAnalysis.txt", row.names = 1, header = TRUE)

Visualisae some general informations : The count matix stores for each samples (in columns) the number reads mapped to each ORF (in rows) counted with bedtools multicov

# Print the first 10 row names (ORF)
row.names(countData)[1:10]

##  [1] "CPAR2_404750" "CPAR2_212570" "CPAR2_302130" "CPAR2_209760" "CPAR2_403260"
##  [6] "CPAR2_209040" "CPAR2_804040" "CPAR2_211500" "CPAR2_102540" "CPAR2_109950"

You can see that the first 4 samples correspond to the 4 Biological Replicates of the anoxic conditions while there are 6 replicates for the normoxic condition

# Print samples name
colnames(countData)

##  [1] "noO2rep1" "noO2rep2" "noO2rep3" "noO2rep4" "O2rep1"   "O2rep2"  
##  [7] "O2rep3"   "O2rep4"   "O2rep5"   "O2rep6"

head(countData)

##              noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750      607      318      461      619    481    487    584    561
## CPAR2_212570      536      257      437      496    633    457    552    648
## CPAR2_302130       24        9       14       14      1      6     10      8
## CPAR2_209760      877      315      550     1207   1400    735    922   1186
## CPAR2_403260     2896      740     3019     2619   1951   1858   2615   2365
## CPAR2_209040      860      258      258      730    289    390    572    452
##              O2rep5 O2rep6
## CPAR2_404750    626    642
## CPAR2_212570    984    721
## CPAR2_302130     11      8
## CPAR2_209760   1585   1328
## CPAR2_403260   5065   3185
## CPAR2_209040    329    372

Load metadata of the experiment (design matrix)

The Design matrix store the information about the biological condition associated to each samples :
N = Normoxic condition (with O2)
H = Hypoxic condition (without O2)

colData <- read.table("/shared/projects/ens_hts_2020/data/rnaseq/design.txt", row.names = 1, header = TRUE)

colData

##          condition
## noO2rep1         H
## noO2rep2         H
## noO2rep3         H
## noO2rep4         H
## O2rep1           N
## O2rep2           N
## O2rep3           N
## O2rep4           N
## O2rep5           N
## O2rep6           N

Initiate the DEseq2’s specific object DESeqDataSet

DEseq2 use a specific format to store all the data required to perform the DE analysis. All the results of our analysis will also be stored in specific slots of this object.

dds <- DESeqDataSetFromMatrix(countData = countData,
                             colData = colData,
                             design = ~condition)

Using head, you can obtain information about the structure of the object to know where informations are stored

head(dds)

## class: DESeqDataSet 
## dim: 6 10 
## metadata(1): version
## assays(1): counts
## rownames(6): CPAR2_404750 CPAR2_212570 ... CPAR2_403260 CPAR2_209040
## rowData names(0):
## colnames(10): noO2rep1 noO2rep2 ... O2rep5 O2rep6
## colData names(1): condition

For instance, the count matrix can be access using : counts()

head(counts(dds))

##              noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750      607      318      461      619    481    487    584    561
## CPAR2_212570      536      257      437      496    633    457    552    648
## CPAR2_302130       24        9       14       14      1      6     10      8
## CPAR2_209760      877      315      550     1207   1400    735    922   1186
## CPAR2_403260     2896      740     3019     2619   1951   1858   2615   2365
## CPAR2_209040      860      258      258      730    289    390    572    452
##              O2rep5 O2rep6
## CPAR2_404750    626    642
## CPAR2_212570    984    721
## CPAR2_302130     11      8
## CPAR2_209760   1585   1328
## CPAR2_403260   5065   3185
## CPAR2_209040    329    372

Preprocessing

Normalization

Normalization is a necessary step to correct for different sequencing depth between samples. As you can see their is significant differences between total number of reads obtained for each samples.

# bar plots
barplot(colSums(counts(dds)), main = "Total read counts per samples")

Histogramme of the total read count per library.

Calculate the size factor

Calculate the size factor for each sample, which will be use as a correcting factor for sequencing depth (by dividing the raw counts by the size factors)

dds <- estimateSizeFactors(dds)
sizeFactors(dds)

##  noO2rep1  noO2rep2  noO2rep3  noO2rep4    O2rep1    O2rep2    O2rep3    O2rep4 
## 1.0032672 0.4149502 0.7813992 1.1207499 1.1734090 0.8409439 1.1246014 1.1915117 
##    O2rep5    O2rep6 
## 1.6880443 1.3096478

# Row data
Row.data <- melt(log2(counts(dds)+1))
colnames(Row.data) <- c("ORF", "Samples","Expr.Val")
Row.data <- data.frame(Row.data,
                 Condition = c(substr(Row.data$Samples[1:(4*5830)], 1, 4),
                               substr(Row.data$Samples[(4*5830+1):nrow(Row.data)], 1, 2)))


ggplot(Row.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
  geom_boxplot() +
  ylab(expression(log[2](count + 1))) + 
  ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value before normalization")+
  theme(axis.text.x=element_text(angle=45, hjust=1))

#Normalized data
Norm.data <- melt(log2(counts(dds, normalized=TRUE)+1))
colnames(Norm.data) <- c("ORF", "Samples","Expr.Val")
Norm.data <- data.frame(Norm.data,
                 Condition = c(substr(Norm.data$Samples[1:(4*5830)], 1, 4),
                               substr(Norm.data$Samples[(4*5830+1):nrow(Norm.data)], 1, 2)))


ggplot(Norm.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
  geom_boxplot() +
  ylab(expression(log[2](count + 1))) + 
  ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value after normalization") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Boxplot representing the per sample distribution, before and after normalization.

PCA of samples

vsd <- varianceStabilizingTransformation(dds, blind=TRUE)

# PCA plots
plotPCA(vsd, ntop = 500) + 
  ggtitle(label = "Principal Component Analysis (PCA)", subtitle = "Top 500 most variable  genes")

Scater plot of sample along PC1 and PC2 coordinates

Differential expression analysis

Variance/Dispersion estimation

dds <- estimateDispersions(dds)

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

plotDispEsts(dds)

Scater plot representing the mean-dispersion relationship for each genes

Perform the negative binomial wald test implemented by DEseq2

dds <- nbinomWaldTest(dds)

# Generate summary of testing. 
summary(results(dds))

## 
## out of 5788 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1431, 25%
## LFC < 0 (down)     : 1412, 24%
## outliers [1]       : 4, 0.069%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Retreive the result table

res <- results(dds)
head(res)

## log2 fold change (MLE): condition N vs H 
## Wald test p-value: condition N vs H 
## DataFrame with 6 rows and 6 columns
##                      baseMean     log2FoldChange             lfcSE
##                     <numeric>          <numeric>         <numeric>
## CPAR2_404750 535.386129765818 -0.405616811586334 0.168628335990942
## CPAR2_212570 540.645102587355  0.016486750191156 0.128962428292018
## CPAR2_302130 11.2237591766997  -1.64723901680616 0.556416032371901
## CPAR2_209760 924.940674020237  0.185594008770904 0.166546919368115
## CPAR2_403260 2448.50318859237 -0.260785304359525 0.229520549444927
## CPAR2_209040 453.746336047737 -0.831617529857861 0.315944776053616
##                           stat              pvalue               padj
##                      <numeric>           <numeric>          <numeric>
## CPAR2_404750 -2.40538939794865  0.0161552413368481 0.0405212124424671
## CPAR2_212570 0.127841499338272   0.898274407772866  0.936668211610331
## CPAR2_302130 -2.96044492065456 0.00307195045993042 0.0103796675953554
## CPAR2_209760  1.11436470560401   0.265122758321658  0.381650083158903
## CPAR2_403260 -1.13621767196972    0.25586547629048  0.371001733483112
## CPAR2_209040 -2.63216103853775 0.00848436368796526 0.0242099455210612

Using mcols() you can obtain a description of the result information stored in column.
Take a look at the log2FoldChange. **What is the directionality of the calculated log2FC ?

mcols(res, use.names=TRUE)

## DataFrame with 6 rows and 2 columns
##                        type                               description
##                 <character>                               <character>
## baseMean       intermediate mean of normalized counts for all samples
## log2FoldChange      results  log2 fold change (MLE): condition N vs H
## lfcSE               results          standard error: condition N vs H
## stat                results          Wald statistic: condition N vs H
## pvalue              results       Wald test p-value: condition N vs H
## padj                results                      BH adjusted p-values

Convert res into data frame

res <- as.data.frame(res)

DE test results analysis

Diagnostic plots

Inspect the histogram of the p values

ggplot(res, aes(x = pvalue)) +
  geom_histogram(bins = 50) + 
  ggtitle(label = "Pvalue distibution")

Histogramme of the p value distribution

Analysis of DE genes

Volcano plot

#Set a new data frame which will store all variable needed to generate the volcano plot
Vplot.data <- data.frame(gene = row.names(res),
                         padj = res$padj, 
                         Log2FC = res$log2FoldChange)

# Remove any rows that have NA 
Vplot.data <- na.omit(Vplot.data)

# Mark the genes which are significantly upreagalted or down regulated by an absolute logFC > 2

Vplot.data <- mutate(Vplot.data, color = case_when(Vplot.data$Log2FC > 2 & Vplot.data$padj < 0.05 ~ "LogFc > 2 in normoxic VS hypoxic",
                                                   Vplot.data$Log2FC < -2 & Vplot.data$padj < 0.05 ~ "LogFc > 2 in Hypoxic VS normoxic",
                                                   Vplot.data$padj > 0.05 | abs(Vplot.data$Log2FC) < 2 ~ "NS"))

# Build the ggplot. Note that we -log10() transform the padj so that gene with the lowest padj will have the higest y axis value
ggplot(Vplot.data, aes(x = Log2FC, y = -log10(padj), color = color)) + # Set the default plot
  geom_point(size = 2.5, alpha = 0.8, na.rm = T) + # Set the dot size
  scale_color_manual(name = "Expression change",
                     values = c("LogFc > 2 in normoxic VS hypoxic" = "#008B00",
                                "LogFc > 2 in Hypoxic VS normoxic" = "#CD4F39",
                                "NS" = "darkgray")) +
  geom_hline(yintercept = 1.3, colour = "red") + #  p-adj value cutoff
  geom_vline(xintercept = c(-2,2), colour = "red") + # Log2FC value cutoff
  xlab(expression(log[2]("Fold change"))) + ylab(expression(-log[10]("adjusted p-value"))) +
  ggtitle(label = "Normoxic VS Hypoxic")

Volcano plot with genes colored according to DE filters

Export results for genes with log2FoldChange > 2 and padj < 0.05

FoldFc_sup2 <- res[res$log2FoldChange > 2 & res$padj <0.05,]
nrow(FoldFc_sup2)

## [1] 127

# To save the result in your home working directory
write.table(FoldFc_sup2, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_sup2.csv")

Export results for genes with log2FoldChange < -2 and padj < 0.05

FoldFc_inf2 <- res[res$log2FoldChange <(-2) & res$padj <0.05,]
nrow(FoldFc_inf2)

## [1] 151

# To save the result in your home working directory
write.table(FoldFc_inf2, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_inf2.csv")

MA plot

plotMA(results(dds))

MA plot of DE test results

Functional annotation of DE genes

Several tools exist on Internet to evaluate the biological relevance of set of genes.
Here, the GoTermFinder tool will be used: http://www.candidagenome.org/cgi-bin/GO/goTermFinder (dedicated to Candida yeast species).

To do : What are the functions of the genes in your lists (identified at the previous step).

Are they relevant with the studied biological system (see Guida et al. for detailed information)?

Reproductibility

#date
format(Sys.time(), "%d %B, %Y, %H,%M")

## [1] "05 September, 2020, 15,16"

#Packages used
sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /shared/mfs/data/software/miniconda/envs/r-3.6.3/lib/libopenblasp-r0.3.9.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.3.2               reshape2_1.4.4             
##  [3] dplyr_0.8.5                 DESeq2_1.26.0              
##  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
##  [7] BiocParallel_1.20.0         matrixStats_0.56.0         
##  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
## [11] GenomeInfoDb_1.22.0         IRanges_2.20.0             
## [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##  [1] bit64_0.9-7            splines_3.6.3          Formula_1.2-3         
##  [4] assertthat_0.2.1       highr_0.8              latticeExtra_0.6-29   
##  [7] blob_1.2.1             GenomeInfoDbData_1.2.2 yaml_2.2.1            
## [10] pillar_1.4.4           RSQLite_2.2.0          backports_1.1.7       
## [13] lattice_0.20-41        glue_1.4.1             digest_0.6.25         
## [16] RColorBrewer_1.1-2     XVector_0.26.0         checkmate_2.0.0       
## [19] colorspace_1.4-1       plyr_1.8.6             htmltools_0.5.0       
## [22] Matrix_1.2-18          XML_3.99-0.3           pkgconfig_2.0.3       
## [25] genefilter_1.68.0      zlibbioc_1.32.0        purrr_0.3.4           
## [28] xtable_1.8-4           scales_1.1.1           jpeg_0.1-8.1          
## [31] htmlTable_1.13.3       tibble_3.0.1           annotate_1.64.0       
## [34] farver_2.0.3           ellipsis_0.3.1         withr_2.2.0           
## [37] nnet_7.3-14            survival_3.1-12        magrittr_1.5          
## [40] crayon_1.3.4           memoise_1.1.0          evaluate_0.14         
## [43] foreign_0.8-76         tools_3.6.3            data.table_1.12.8     
## [46] lifecycle_0.2.0        stringr_1.4.0          locfit_1.5-9.4        
## [49] munsell_0.5.0          cluster_2.1.0          AnnotationDbi_1.48.0  
## [52] compiler_3.6.3         rlang_0.4.6            grid_3.6.3            
## [55] RCurl_1.98-1.2         rstudioapi_0.11        htmlwidgets_1.5.1     
## [58] labeling_0.3           bitops_1.0-6           base64enc_0.1-3       
## [61] rmarkdown_2.1          gtable_0.3.0           DBI_1.1.0             
## [64] R6_2.4.1               gridExtra_2.3          knitr_1.29            
## [67] bit_1.1-15.2           Hmisc_4.4-0            stringi_1.4.6         
## [70] Rcpp_1.0.5             geneplotter_1.64.0     vctrs_0.3.0           
## [73] rpart_4.1-15           acepack_1.4.1          png_0.1-7             
## [76] tidyselect_1.1.0       xfun_0.16

DE analysis using DEseq2

Stephane Le Crom, stephane.le_crom@sorbonne-universite.fr

Matthieu Moreau, matthieu.moreau@inserm.fr

05 September, 2020