In their article (Guida et al., 2011), the authors repeated the experiment 6 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2).
Results obtained for all experiments are combined in the file “/shared/projects/2020_eu_HTSdataAnalysis/RNAseq/R/count_data_diffAnalysis.txt”. This file will be used to search for differentially expressed genes using the DESeq2 R package (Love et al. 2014).
The DESeq2 package provides methods to test for differential expression by use of the negative binonial distribution and a shrinkage estimator for the distribution’s variance.
library(DESeq2)
library(dplyr)
library(reshape2)
library(ggplot2)
#Set ggplot theme as classic
theme_set(theme_classic())
countData <- read.table("/shared/projects/ens_hts_2020/data/rnaseq/count_data_diffAnalysis.txt", row.names = 1, header = TRUE)
Visualisae some general informations : The count matix stores for each samples (in columns) the number reads mapped to each ORF (in rows) counted with bedtools multicov
# Print the first 10 row names (ORF)
row.names(countData)[1:10]
## [1] "CPAR2_404750" "CPAR2_212570" "CPAR2_302130" "CPAR2_209760" "CPAR2_403260"
## [6] "CPAR2_209040" "CPAR2_804040" "CPAR2_211500" "CPAR2_102540" "CPAR2_109950"
You can see that the first 4 samples correspond to the 4 Biological Replicates of the anoxic conditions while there are 6 replicates for the normoxic condition
# Print samples name
colnames(countData)
## [1] "noO2rep1" "noO2rep2" "noO2rep3" "noO2rep4" "O2rep1" "O2rep2"
## [7] "O2rep3" "O2rep4" "O2rep5" "O2rep6"
head(countData)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
The Design matrix store the information about the biological condition associated to each samples :
N = Normoxic condition (with O2)
H = Hypoxic condition (without O2)
colData <- read.table("/shared/projects/ens_hts_2020/data/rnaseq/design.txt", row.names = 1, header = TRUE)
colData
## condition
## noO2rep1 H
## noO2rep2 H
## noO2rep3 H
## noO2rep4 H
## O2rep1 N
## O2rep2 N
## O2rep3 N
## O2rep4 N
## O2rep5 N
## O2rep6 N
DEseq2 use a specific format to store all the data required to perform the DE analysis. All the results of our analysis will also be stored in specific slots of this object.
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~condition)
Using head, you can obtain information about the structure of the object to know where informations are stored
head(dds)
## class: DESeqDataSet
## dim: 6 10
## metadata(1): version
## assays(1): counts
## rownames(6): CPAR2_404750 CPAR2_212570 ... CPAR2_403260 CPAR2_209040
## rowData names(0):
## colnames(10): noO2rep1 noO2rep2 ... O2rep5 O2rep6
## colData names(1): condition
For instance, the count matrix can be access using : counts()
head(counts(dds))
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## CPAR2_404750 607 318 461 619 481 487 584 561
## CPAR2_212570 536 257 437 496 633 457 552 648
## CPAR2_302130 24 9 14 14 1 6 10 8
## CPAR2_209760 877 315 550 1207 1400 735 922 1186
## CPAR2_403260 2896 740 3019 2619 1951 1858 2615 2365
## CPAR2_209040 860 258 258 730 289 390 572 452
## O2rep5 O2rep6
## CPAR2_404750 626 642
## CPAR2_212570 984 721
## CPAR2_302130 11 8
## CPAR2_209760 1585 1328
## CPAR2_403260 5065 3185
## CPAR2_209040 329 372
Normalization is a necessary step to correct for different sequencing depth between samples. As you can see their is significant differences between total number of reads obtained for each samples.
# bar plots
barplot(colSums(counts(dds)), main = "Total read counts per samples")
Histogramme of the total read count per library.
Calculate the size factor for each sample, which will be use as a correcting factor for sequencing depth (by dividing the raw counts by the size factors)
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## noO2rep1 noO2rep2 noO2rep3 noO2rep4 O2rep1 O2rep2 O2rep3 O2rep4
## 1.0032672 0.4149502 0.7813992 1.1207499 1.1734090 0.8409439 1.1246014 1.1915117
## O2rep5 O2rep6
## 1.6880443 1.3096478
# Row data
Row.data <- melt(log2(counts(dds)+1))
colnames(Row.data) <- c("ORF", "Samples","Expr.Val")
Row.data <- data.frame(Row.data,
Condition = c(substr(Row.data$Samples[1:(4*5830)], 1, 4),
substr(Row.data$Samples[(4*5830+1):nrow(Row.data)], 1, 2)))
ggplot(Row.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value before normalization")+
theme(axis.text.x=element_text(angle=45, hjust=1))
#Normalized data
Norm.data <- melt(log2(counts(dds, normalized=TRUE)+1))
colnames(Norm.data) <- c("ORF", "Samples","Expr.Val")
Norm.data <- data.frame(Norm.data,
Condition = c(substr(Norm.data$Samples[1:(4*5830)], 1, 4),
substr(Norm.data$Samples[(4*5830+1):nrow(Norm.data)], 1, 2)))
ggplot(Norm.data, aes(x = Samples, y = Expr.Val, fill = Condition)) +
geom_boxplot() +
ylab(expression(log[2](count + 1))) +
ggtitle(label = "Per sample distribution of the\nlog2 transformed expression value after normalization") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Boxplot representing the per sample distribution, before and after normalization.
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
# PCA plots
plotPCA(vsd, ntop = 500) +
ggtitle(label = "Principal Component Analysis (PCA)", subtitle = "Top 500 most variable genes")
Scater plot of sample along PC1 and PC2 coordinates
dds <- estimateDispersions(dds)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
plotDispEsts(dds)
Scater plot representing the mean-dispersion relationship for each genes
dds <- nbinomWaldTest(dds)
# Generate summary of testing.
summary(results(dds))
##
## out of 5788 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1431, 25%
## LFC < 0 (down) : 1412, 24%
## outliers [1] : 4, 0.069%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Retreive the result table
res <- results(dds)
head(res)
## log2 fold change (MLE): condition N vs H
## Wald test p-value: condition N vs H
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## CPAR2_404750 535.386129765818 -0.405616811586334 0.168628335990942
## CPAR2_212570 540.645102587355 0.016486750191156 0.128962428292018
## CPAR2_302130 11.2237591766997 -1.64723901680616 0.556416032371901
## CPAR2_209760 924.940674020237 0.185594008770904 0.166546919368115
## CPAR2_403260 2448.50318859237 -0.260785304359525 0.229520549444927
## CPAR2_209040 453.746336047737 -0.831617529857861 0.315944776053616
## stat pvalue padj
## <numeric> <numeric> <numeric>
## CPAR2_404750 -2.40538939794865 0.0161552413368481 0.0405212124424671
## CPAR2_212570 0.127841499338272 0.898274407772866 0.936668211610331
## CPAR2_302130 -2.96044492065456 0.00307195045993042 0.0103796675953554
## CPAR2_209760 1.11436470560401 0.265122758321658 0.381650083158903
## CPAR2_403260 -1.13621767196972 0.25586547629048 0.371001733483112
## CPAR2_209040 -2.63216103853775 0.00848436368796526 0.0242099455210612
Using mcols() you can obtain a description of the result information stored in column.
Take a look at the log2FoldChange. **What is the directionality of the calculated log2FC ?
mcols(res, use.names=TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized counts for all samples
## log2FoldChange results log2 fold change (MLE): condition N vs H
## lfcSE results standard error: condition N vs H
## stat results Wald statistic: condition N vs H
## pvalue results Wald test p-value: condition N vs H
## padj results BH adjusted p-values
Convert res into data frame
res <- as.data.frame(res)
ggplot(res, aes(x = pvalue)) +
geom_histogram(bins = 50) +
ggtitle(label = "Pvalue distibution")
Histogramme of the p value distribution
#Set a new data frame which will store all variable needed to generate the volcano plot
Vplot.data <- data.frame(gene = row.names(res),
padj = res$padj,
Log2FC = res$log2FoldChange)
# Remove any rows that have NA
Vplot.data <- na.omit(Vplot.data)
# Mark the genes which are significantly upreagalted or down regulated by an absolute logFC > 2
Vplot.data <- mutate(Vplot.data, color = case_when(Vplot.data$Log2FC > 2 & Vplot.data$padj < 0.05 ~ "LogFc > 2 in normoxic VS hypoxic",
Vplot.data$Log2FC < -2 & Vplot.data$padj < 0.05 ~ "LogFc > 2 in Hypoxic VS normoxic",
Vplot.data$padj > 0.05 | abs(Vplot.data$Log2FC) < 2 ~ "NS"))
# Build the ggplot. Note that we -log10() transform the padj so that gene with the lowest padj will have the higest y axis value
ggplot(Vplot.data, aes(x = Log2FC, y = -log10(padj), color = color)) + # Set the default plot
geom_point(size = 2.5, alpha = 0.8, na.rm = T) + # Set the dot size
scale_color_manual(name = "Expression change",
values = c("LogFc > 2 in normoxic VS hypoxic" = "#008B00",
"LogFc > 2 in Hypoxic VS normoxic" = "#CD4F39",
"NS" = "darkgray")) +
geom_hline(yintercept = 1.3, colour = "red") + # p-adj value cutoff
geom_vline(xintercept = c(-2,2), colour = "red") + # Log2FC value cutoff
xlab(expression(log[2]("Fold change"))) + ylab(expression(-log[10]("adjusted p-value"))) +
ggtitle(label = "Normoxic VS Hypoxic")
Volcano plot with genes colored according to DE filters
Export results for genes with log2FoldChange > 2 and padj < 0.05
FoldFc_sup2 <- res[res$log2FoldChange > 2 & res$padj <0.05,]
nrow(FoldFc_sup2)
## [1] 127
# To save the result in your home working directory
write.table(FoldFc_sup2, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_sup2.csv")
Export results for genes with log2FoldChange < -2 and padj < 0.05
FoldFc_inf2 <- res[res$log2FoldChange <(-2) & res$padj <0.05,]
nrow(FoldFc_inf2)
## [1] 151
# To save the result in your home working directory
write.table(FoldFc_inf2, row.names = T, quote = F, sep = ";", file = "~/RNAseq_FoldFc_inf2.csv")
plotMA(results(dds))
MA plot of DE test results
Several tools exist on Internet to evaluate the biological relevance of set of genes.
Here, the GoTermFinder tool will be used: http://www.candidagenome.org/cgi-bin/GO/goTermFinder (dedicated to Candida yeast species).
#date
format(Sys.time(), "%d %B, %Y, %H,%M")
## [1] "05 September, 2020, 15,16"
#Packages used
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /shared/mfs/data/software/miniconda/envs/r-3.6.3/lib/libopenblasp-r0.3.9.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 reshape2_1.4.4
## [3] dplyr_0.8.5 DESeq2_1.26.0
## [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
## [7] BiocParallel_1.20.0 matrixStats_0.56.0
## [9] Biobase_2.46.0 GenomicRanges_1.38.0
## [11] GenomeInfoDb_1.22.0 IRanges_2.20.0
## [13] S4Vectors_0.24.0 BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 splines_3.6.3 Formula_1.2-3
## [4] assertthat_0.2.1 highr_0.8 latticeExtra_0.6-29
## [7] blob_1.2.1 GenomeInfoDbData_1.2.2 yaml_2.2.1
## [10] pillar_1.4.4 RSQLite_2.2.0 backports_1.1.7
## [13] lattice_0.20-41 glue_1.4.1 digest_0.6.25
## [16] RColorBrewer_1.1-2 XVector_0.26.0 checkmate_2.0.0
## [19] colorspace_1.4-1 plyr_1.8.6 htmltools_0.5.0
## [22] Matrix_1.2-18 XML_3.99-0.3 pkgconfig_2.0.3
## [25] genefilter_1.68.0 zlibbioc_1.32.0 purrr_0.3.4
## [28] xtable_1.8-4 scales_1.1.1 jpeg_0.1-8.1
## [31] htmlTable_1.13.3 tibble_3.0.1 annotate_1.64.0
## [34] farver_2.0.3 ellipsis_0.3.1 withr_2.2.0
## [37] nnet_7.3-14 survival_3.1-12 magrittr_1.5
## [40] crayon_1.3.4 memoise_1.1.0 evaluate_0.14
## [43] foreign_0.8-76 tools_3.6.3 data.table_1.12.8
## [46] lifecycle_0.2.0 stringr_1.4.0 locfit_1.5-9.4
## [49] munsell_0.5.0 cluster_2.1.0 AnnotationDbi_1.48.0
## [52] compiler_3.6.3 rlang_0.4.6 grid_3.6.3
## [55] RCurl_1.98-1.2 rstudioapi_0.11 htmlwidgets_1.5.1
## [58] labeling_0.3 bitops_1.0-6 base64enc_0.1-3
## [61] rmarkdown_2.1 gtable_0.3.0 DBI_1.1.0
## [64] R6_2.4.1 gridExtra_2.3 knitr_1.29
## [67] bit_1.1-15.2 Hmisc_4.4-0 stringi_1.4.6
## [70] Rcpp_1.0.5 geneplotter_1.64.0 vctrs_0.3.0
## [73] rpart_4.1-15 acepack_1.4.1 png_0.1-7
## [76] tidyselect_1.1.0 xfun_0.16