Aim: Getting started with statistical approaches and bioinformatics tools commonly used to analyze microarray experiments and to cluster genes according to their expression profiles.

This practical is divided in three sections:

Correction of experimental biases (part I)
Detection of differentially expressed genes (part II)
Clustering of co-expressed genes (part III)

The first microarray datasets were collected from the publication of Guida et al.2011.
The authors used high throughput technologies (microarrays and deep sequencing) to determine the transcriptional profile of the pathogenic yeast Candida parapsilosis growing in several conditions including media, temperature and oxygen concentrations.
We will use the datasets related to the study of the hypoxic (low oxygen) response in C. parapsilosis.

library loading

library(marray)
library(limma)

1 Preprocessing of the raw data : Correction of experimental biases

The hypoxia experiments were performed comparing one cell culture incubated at atmospheric oxygen conditions (Cy5 signal) and another one incubated in 1% O~2 (Cy3 signal).

Reading of GPR file

Input : a GPR file with detailed information for each spot on the slide (gene name, Cy5 and Cy3 intensity values, background intensities and other statistics).

The library Marray offers several functions to : * Read GPR files * Draw graphical representations of microarray results (foreground and background signals, missing values, MAplots, etc.) * Perform the normalization between Cy5 and Cy3 signals.

#Read the GPR file usinf the marray package function read.GenePix
rawdata <- read.GenePix(fnames="dataFile1_normAnalysis.gpr",
                        path= "/shared/projects/ens_hts_2020/data/microarrays/data/")

## Reading ...  /shared/projects/ens_hts_2020/data/microarrays/data//dataFile1_normAnalysis.gpr

Note : This function reads a GPR file and creates objects of class “marrayRaw”. In these objects, you can find, for instance, vectors with intensity values (“rawdata@maRf” or “rawdata@maGf”). These vectors can be manipulated using classical R functions like “summary()”, “hist()”, etc.

To do: Take a few minutes to better understand the structure of the R object “marrayRaw”. Start for instance to manipulate the vectors with foreground signals (“rawdata@maRf” or “rawdata@maGf”).

# intensity values in red channel f = foreground
head(rawdata@maRf)

##      /shared/projects/ens_hts_2020/data/microarrays/data//dataFile1_normAnalysis.gpr
## [1,]                                                                             992
## [2,]                                                                            1907
## [3,]                                                                             559
## [4,]                                                                             645
## [5,]                                                                           32939
## [6,]                                                                             681

# intensity values in green channel
head(rawdata@maGf)

##      /shared/projects/ens_hts_2020/data/microarrays/data//dataFile1_normAnalysis.gpr
## [1,]                                                                            2561
## [2,]                                                                            2585
## [3,]                                                                            1588
## [4,]                                                                            1604
## [5,]                                                                           43755
## [6,]                                                                            1732

Visualization of foreground signal

# Red signals
image(rawdata,
      xvar = "maRf",
      main = "Red signal (with flags)")

## [1] FALSE

## NULL

# Green signals
image(rawdata,
      xvar = "maGf",
      main = "Red signal (with flags)")

## [1] FALSE

## NULL

Evaluation of data quality

Visualization of background signal

To do: Visualize background signals in Red and Green channels.

Try to interpret the obtained results. How is the quality of the experiment?

# Red channel background signals
image(rawdata,
      xvar = "maRb",
      main = "Red background (with flags)")

## [1] FALSE

## NULL

# Green channel background signals
image(rawdata,
      xvar = "maGb",
      main = "Green background (with flags)")

## [1] FALSE

## NULL

“Flag” locations on the slide

Each spot is automaticaly associated with a flag value reporting some quality information

Flag Values :
* -50 not found * -75 empty * -100 bad * 0 good

To do: Manipulate flags annotation.

What is the number of spots for each type of Flags?
What does that mean?
Visualize the location of the Flags on the slide. Is there any problem?

Manipulate flagged spots

Create a toy copy of the rawdata for example purpose

copyRawData <- rawdata
table(rawdata@maW)

## 
## -100  -75  -50    0 
##  103  192 5922 9335

Remove background intensity value for flagged spots

copyRawData@maRb[rawdata@maW < 0] = NA
copyRawData@maGb[rawdata@maW < 0] = NA

Visualization of background signals without flags

image(rawdata,
      xvar = "maRb",
      main = "Red background (with flags)",
      colorinfo =F)

## [1] FALSE

## NULL

image(copyRawData,
      xvar = "maRb",
      main = "Red background (without flag)",
      colorinfo =F)

## [1] FALSE

## NULL

Comparison of the red background signal with and without flagged spots

image(rawdata,
      xvar = "maGb",
      main = "Green background (with flags)",
      colorinfo =F)

## [1] FALSE

## NULL

image(copyRawData,
      xvar = "maGb",
      main = "Green background (without flag)",
      colorinfo =F)

## [1] FALSE

## NULL

Comparison of the green background signal with and without flagged spots

# Remove the toy object
rm(copyRawData)

In the next section, experimental biases will be corrected and it is important to exclude all the spots for which the Flag values are negative. For that, intensity values in foreground and background signals have to be carefully replaced with the R symbol “NA” (missing values, “Not Available”).

Filter flagged spots

Flag location on the slide

MyColor <- maPalette(low = "blue", high = "white" , k = 10)

image(rawdata,
      xvar = "maW",
      col = MyColor,
      zlim = c(min(rawdata@maW), max(rawdata@maW)),
      main = "Location of Flags on the slide")

## [1] FALSE

## NULL

Negatively flagged spots will be eliminated from further analyses by replacing their intensity values are replaced by NA (missing values)

rawdataWithoutFlags <- rawdata

#Background signal to NA
rawdataWithoutFlags@maRb[rawdataWithoutFlags@maW < 0] = NA
rawdataWithoutFlags@maGb[rawdataWithoutFlags@maW < 0] = NA 

#Signal value to NA
rawdataWithoutFlags@maRf[rawdataWithoutFlags@maW < 0] = NA 
rawdataWithoutFlags@maGf[rawdataWithoutFlags@maW < 0] = NA

Background correction

An intuitive approach for background correction consists in subtracting background intensity values (“rawdata@maRb” and “rawdata@maGb”) from global foreground intensities (“rawdata@maRf” and “rawdata@maGf”). Nevertheless this method can be debatable mainly because it can create overestimated log(Ratio) values in case of low intensities. For this reason the following analyses will be performed with no background subtraction.

#Replace all background by 0
rawdataWithoutFlags@maGb[] = 0
rawdataWithoutFlags@maRb[] = 0

Normalization

Comparison of Cy5 and Cy3 global signals

To do: Draw the MA plot between Cy5 and Cy3 signals.

How is the experimental bias?
What supplementary information gives the boxplot representation?
What kind of normalization method needs to be applied (global or local)?

plot(rawdataWithoutFlags,legend.func = NULL, main = "MA plot before normalization")
plot(rawdataWithoutFlags, main = "MA plot before normalization")

boxplot(rawdataWithoutFlags, main = "Boxplot before normalization")

Compare the three different normalization procedures for signal normalization

To do: Normalize the intensity measures between Cy5 and Cy3 signals.

Try different methods (« median », « loess » and « printTipLoess »).
Draw the log2(R/G) distribution before and after normalization.
How do you interpret the results?
Draw the associated MA plot and BoxPlot (after normalization). What are the differences with the graphs obtained before normalization?

rawdataWithoutFlagsNorm <- maNorm(rawdataWithoutFlags, norm = "median", echo = T)

## Normalization method: median.
## Normalizing array 1.

rawdataWithoutFlagsNorm2 <- maNorm(rawdataWithoutFlags, norm = "loess", echo = T)

## Normalization method: loess.
## Normalizing array 1.

rawdataWithoutFlagsNorm3 <- maNorm(rawdataWithoutFlags, norm = "printTipLoess", echo = T)

## Normalization method: printTipLoess.
## Normalizing array 1.

Several plots allow for comparison of the normalization methods

plot(rawdataWithoutFlagsNorm, legend.func = NULL, main = "norm = Median")

plot(rawdataWithoutFlagsNorm2, legend.func = NULL, main = "norm = Loess")

plot(rawdataWithoutFlagsNorm3, legend.func = NULL, main = "norm = printTipLoess")

boxplot(rawdataWithoutFlagsNorm, main = "norm = Median")

boxplot(rawdataWithoutFlagsNorm2, main = "norm = Loess")

boxplot(rawdataWithoutFlagsNorm3, main = "norm = printTipLoess")

plot(density(maM(rawdataWithoutFlagsNorm2),na.rm = T),
     lwd = 2, col = 2, main = "Distribution of log(Ratio)")
lines(density(maM(rawdataWithoutFlags), na.rm = T), lwd = 2)
abline(v = 0)
legend(x= 1, y= 1,c("Before normalization","After normalization with loess"), fill = c(1,2))

2 Search for differentially expressed genes

In their article (Guida et al., 2011), the authors repeated the experiment 4 times for normoxic condition (with O~2 ) and 4 times for hypoxic conditions (without O~2 ). Expressions of genes between the two conditions were compared using microarrays (Ratio = hypoxia / normoxia).

To do: Identify the differentially expressed genes using LIMMA method.

Try different p-value thresholds: 5%, 1%, etc..
How many genes are induced in hypoxic condition (without 02 > with 02)?.
How many genes are repressed in hypoxic condition (without O2 < with 02)?

Load the matrix containing the normalized log ratio intensity value for each replicates

Input : a text file with four different biological replicates (after normalization).

dataFile <- "/shared/projects/ens_hts_2020/data/microarrays/data/dataFile_diffAnalysis.txt"
data <- as.matrix(read.table(dataFile, row.names = 1, header = T))

dim(data)

## [1] 5526    4

data[1:10,1:4]

##                   logVal1      logVal2      logVal3     logVal4
## CPAR2_201050 -0.265265616 -0.130465012  0.008997103 -0.06624613
## CPAR2_101960 -0.843512598 -0.608422137 -0.103000282 -0.45358870
## CPAR2_101290  0.056414092  0.000296908 -0.068354697  0.05983511
## CPAR2_405520  0.464588136  0.509999239  0.284349940  0.44530769
## CPAR2_201590 -0.230207648 -0.176294382 -0.265324830 -0.24833664
## CPAR2_103750 -0.194992750 -0.186335163  0.191242260 -0.57185971
## CPAR2_100170 -0.132982234 -0.191465175 -0.126354218  0.00331530
## CPAR2_202790  0.973402061  0.853915233  0.808972712  0.74969076
## CPAR2_301860 -0.008917937  0.018171339 -0.021780941  0.16899955
## CPAR2_106430 -1.598703129 -1.508676852 -0.642865880 -0.87494246

Linear model estimations and calculation of the Bayesian statistics

We will performe the DE analysis using the limma package

# Linear model estimation
fit <- lmFit(data)

# Bayesian statistics
limma.res <- eBayes(fit)

head(topTable(limma.res))

##                 logFC  AveExpr        t      P.Value    adj.P.Val        B
## CPAR2_404850 6.462651 6.462651 71.45643 3.788865e-10 2.093727e-06 13.59386
## CPAR2_503990 5.168192 5.168192 61.93992 9.047930e-10 2.499943e-06 13.02649
## CPAR2_502580 3.504953 3.504953 50.62005 3.091002e-09 4.333583e-06 12.10863
## CPAR2_807620 3.614666 3.614666 50.49767 3.136868e-09 4.333583e-06 12.09684
## CPAR2_401230 3.328905 3.328905 45.27100 6.097772e-09 5.982882e-06 11.54690
## CPAG_00607   3.396506 3.396506 43.79661 7.457963e-09 5.982882e-06 11.37366

Selection of differentially expressed genes (p-vlaue threshold = 0.01 here)

allgenes.limma <- topTable(limma.res, number = nrow(data)) # Retreive result table for all genes
siggenes.limma <- allgenes.limma[allgenes.limma[,5] < 0.01,] # Filter on the adj.P.Val

paste(dim(siggenes.limma[siggenes.limma[,2] > 0,])[1], "upregulated genes (logFC value > 0)")

## [1] "942 upregulated genes (logFC value > 0)"

paste(dim(siggenes.limma[siggenes.limma[,2] < 0,])[1], "downpregulated genes (logFC value < 0)")

## [1] "725 downpregulated genes (logFC value < 0)"

# To export DE gene table into your home directory:
write.table(siggenes.limma[siggenes.limma[,2] > 0,], 
            row.names = T, quote = F, sep = ";",
            file = "~/limma_up_signif_genes.csv")

write.table(siggenes.limma[siggenes.limma[,2] < 0,], 
            row.names = T, quote = F, sep = ";",
            file = "~/limma_low_signif_genes.csv")

Plot results of the DE analysis

Volcano plot

attach(allgenes.limma)

volcanoplot(limma.res, main = "Hypoxic  VS normoxic ",pch =21)
abline(v = c(-2,2), col = "red")
abline(h = 1.3, col= "red", lty =2)
points(siggenes.limma$logFC[logFC >2 & adj.P.Val < 0.05], -1 * log10(siggenes.limma$P.Value[logFC >2  & adj.P.Val < 0.05]), col ="red")
points(siggenes.limma$logFC[logFC <(-2) & adj.P.Val < 0.05], -1 * log10(siggenes.limma$P.Value[logFC <(-2)  & adj.P.Val < 0.05]), col ="green")
legend("topleft", c("11 genes with LogFC > 2 in hypoxic VS normoxic", "78 genes with LogFc > 2 in normoxic VS hypoxic"),pch = 21, col = c("green", "red"), bty ="n", cex =.9)

3 Functional analyzes of differentially expressed genes

Several tools exist on Internet to evaluate the biological relevance of set of genes.
Here, the GoTermFinder tool will be used: http://www.candidagenome.org/cgi-bin/GO/goTermFinder (dedicated to Candida yeast species).

To do : What are the functions of the genes in your lists (identified at the previous step).

Are they relevant with the studied biological system (see Guida et al. for detailed information)?

Reproductibility

#date
format(Sys.time(), "%d %B, %Y, %H,%M")

## [1] "05 September, 2020, 15,03"

#Packages used
sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /shared/mfs/data/software/miniconda/envs/r-3.6.3/lib/libopenblasp-r0.3.9.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] marray_1.64.0 limma_3.42.2 
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  magrittr_1.5    tools_3.6.3     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.1   highr_0.8      
##  [9] knitr_1.29      stringr_1.4.0   xfun_0.16       digest_0.6.25  
## [13] rlang_0.4.6     evaluate_0.14

Microarray

Stephane Le Crom, stephane.le_crom@sorbonne-universite.fr

Matthieu Moreau, matthieu.moreau@inserm.fr

05 September, 2020