University of Copenhagen
Gordon Smyth
6 November 2007
This case study illustrates more advanced linear modeling with Affymetrix single-channel microarrays. The popular 2x2 factorial design is considered. Use of Bioconductor annotation for Affymetrix arrays is illustrated.
The estrogen data set is required for this lab. Set your R working directory to be ku2007limma/data/estrogen. By typing dir() you should see eight CEL files and three text files. On my computer, I see:
> setwd("estrogen") > dir() [1] "ERgenes.txt" "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" [7] "low10-2.cel" "low48-1.cel" "low48-2.cel" "targets.txt"
To repeat this case study in full you will need to have the R packages affy, hgu95av2cdf and hgu95av2 installed.
The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours). The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway.
This data is from the estrogen data package on the Bioconductor website http://www.bioconductor.org/data/experimental.html. Rather than loading the data package here we simply using the nine basic data files from that package, in order to save storage space and to more closely mimic a real data analysis situation. The data set is discussed further by Scholtens [1,2] and in the Limma User's Guide.
The first step in most analyses is to read the targets file which describes what RNA target has been hybridized to each array and, equally importantly, gives the names of the corresponding data files.
> library(limma) > targets <- readTargets() > targets
Now read the CEL file data into an AffyBatch object and normalize using RMA:
> library(affy) > library(hgu95av2cdf) > abatch <- ReadAffy(filenames=targets$filename) > eset <- rma(abatch)
Here eset is a data object of class exprSet.
It is usual and appropriate to check data quality before continuing your analysis. Due to brevity we will skip over this in this lab. A full set of quality assessment plots can be found at http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/ under the title "estrogen". These plots show no significant quality problems with any arrays in this dataset.
We have four pairs of replicate arrays so we should estimate four parameters
in the linear model. There are many valid ways to choose a design matrix, but
perhaps the simplest is to make each column correspond to a particular treatment
combination. In this formulation, the four columns of the matrix correspond to absent10
,
present10
, absent48
and present48
,
respectively. This design matrix given above can be computed in R as follows:
> f <- paste(targets$estrogen,targets$time.h,sep="") > f <- factor(f) > f > design <- model.matrix(~0+f) > colnames(design) <- levels(f) > design
Now that we have defined our design matrix, fitting a linear model is as simple as:
> fit <- lmFit(eset, design)
fit
is an object of class MArrayLM
.
> names(fit)
The fitted coefficents fit$coef from the model fit are just the mean
log-expression for each treatment combination for each probe set. For this
reason, this choice of design matrix is called the group means
parametrization in the Limma User's Guide.
The idea now is to use contrasts to make any comparisons of interest between
the four treatment combination. Contrasts are linear combinations of the
coefficients from the linear model fit. We will estimate three contrasts (so our contrasts matrix will have three
columns). The first contrast is an estrogen effect at time 10 hours,
the second as an estrogen effect at time 48 hours and the third is the time effect in
the absence of estrogen. Many other contrasts could be made.
> cont.matrix <- makeContrasts(E10="present10-absent10",E48="present48-absent48",Time="absent48-absent10",levels=design) > cont.matrix
> fit2 <- contrasts.fit(fit, cont.matrix) > fit2 <- eBayes(fit2)
We now use the function topTable
to obtain a list genes differentially
expressed between Estrogen-Present and Estrogen-Absent at time 10 hours, followed by a list of
genes differentially expressed between Estrogen-Present and Estrogen-Absent at time 48 hours.
> colnames(fit2) > topTable(fit2,coef=1) > topTable(fit2,coef=2)
The function decideTests() provides a variety of ways to assign statistical significance to the contrasts while controlling for multiple testing.
> results <- decideTests(fit2) > summary(results) > vennDiagram(results)
Load the annotation package hgu95av2
, which can be obtained from
http://www.bioconductor.org/data/metaData.html.
> library(hgu95av2) > library(annotate) > fit2$genes$Symbol <- getSYMBOL(fit2$genes$ID, "hgu95av2") > topTable(fit2, coef=2)
At this point you should save your results, ready to continue this case study in the next practical exercise after lunch.
> save(fit2, file="estrogen.rdata")
Thanks to Denise Scholtens and Robert Gentleman for providing the estrogen data http://www.bioconductor.org/data/experimental.html.