University of Copenhagen

Estrogen Data
A 2x2 factorial experiment

Gordon Smyth
6 November 2007

1. Aims

This case study illustrates more advanced linear modeling with Affymetrix single-channel microarrays. The popular 2x2 factorial design is considered. Use of Bioconductor annotation for Affymetrix arrays is illustrated.

2. Required data

The estrogen data set is required for this lab. Set your R working directory to be ku2007limma/data/estrogen. By typing dir() you should see eight CEL files and three text files. On my computer, I see:

> setwd("estrogen")
> dir()
[1] "ERgenes.txt" "high10-1.cel" "high10-2.cel" "high48-1.cel" "high48-2.cel" "low10-1.cel" 
[7] "low10-2.cel" "low48-1.cel" "low48-2.cel" "targets.txt" 

To repeat this case study in full you will need to have the R packages affy, hgu95av2cdf and hgu95av2 installed.

3. Estrogen experiment

The data gives results from a 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays. The factors in this experiment were estrogen (present or absent) and length of exposure (10 or 48 hours). The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders. Genes which respond early are putative direct-target genes while those which respond late are probably downstream targets in the molecular pathway.

This data is from the estrogen data package on the Bioconductor website http://www.bioconductor.org/data/experimental.html. Rather than loading the data package here we simply using the nine basic data files from that package, in order to save storage space and to more closely mimic a real data analysis situation. The data set is discussed further by Scholtens [1,2] and in the Limma User's Guide.

4. Read the data

The first step in most analyses is to read the targets file which describes what RNA target has been hybridized to each array and, equally importantly, gives the names of the corresponding data files.

> library(limma)
> targets <- readTargets()
> targets

Now read the CEL file data into an AffyBatch object and normalize using RMA:

> library(affy)
> library(hgu95av2cdf)
> abatch <- ReadAffy(filenames=targets$filename)
> eset <- rma(abatch)

Here eset is a data object of class exprSet.

It is usual and appropriate to check data quality before continuing your analysis. Due to brevity we will skip over this in this lab. A full set of quality assessment plots can be found at http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/ under the title "estrogen". These plots show no significant quality problems with any arrays in this dataset.

5. Create a design matrix

We have four pairs of replicate arrays so we should estimate four parameters in the linear model. There are many valid ways to choose a design matrix, but perhaps the simplest is to make each column correspond to a particular treatment combination. In this formulation, the four columns of the matrix correspond to absent10, present10, absent48 and present48, respectively. This design matrix given above can be computed in R as follows:

> f <- paste(targets$estrogen,targets$time.h,sep="")
> f <- factor(f)
> f
> design <- model.matrix(~0+f)
> colnames(design) <- levels(f)
> design

6. Fit the linear model

Now that we have defined our design matrix, fitting a linear model is as simple as:

> fit <- lmFit(eset, design)

fit is an object of class MArrayLM.

> names(fit)

The fitted coefficents fit$coef from the model fit are just the mean log-expression for each treatment combination for each probe set. For this reason, this choice of design matrix is called the group means parametrization in the Limma User's Guide.

7. Define a contrast matrix

The idea now is to use contrasts to make any comparisons of interest between the four treatment combination. Contrasts are linear combinations of the coefficients from the linear model fit. We will estimate three contrasts (so our contrasts matrix will have three columns). The first contrast is an estrogen effect at time 10 hours, the second as an estrogen effect at time 48 hours and the third is the time effect in the absence of estrogen. Many other contrasts could be made.

> cont.matrix <- makeContrasts(E10="present10-absent10",E48="present48-absent48",Time="absent48-absent10",levels=design)
> cont.matrix

8. Extract the linear model fit for the contrasts

> fit2  <- contrasts.fit(fit, cont.matrix)
> fit2  <- eBayes(fit2)

9. Assessing differential expression

We now use the function topTable to obtain a list genes differentially expressed between Estrogen-Present and Estrogen-Absent at time 10 hours, followed by a list of genes differentially expressed between Estrogen-Present and Estrogen-Absent at time 48 hours.

> colnames(fit2)
> topTable(fit2,coef=1)
> topTable(fit2,coef=2)

The function decideTests() provides a variety of ways to assign statistical significance to the contrasts while controlling for multiple testing.

> results <- decideTests(fit2)
> summary(results)
> vennDiagram(results)

10. Probe annotation information

Load the annotation package hgu95av2, which can be obtained from http://www.bioconductor.org/data/metaData.html.

> library(hgu95av2)
> library(annotate)
> fit2$genes$Symbol <- getSYMBOL(fit2$genes$ID, "hgu95av2")
> topTable(fit2, coef=2)

11. Save results

At this point you should save your results, ready to continue this case study in the next practical exercise after lunch.

> save(fit2, file="estrogen.rdata")

Acknowledgements

Thanks to Denise Scholtens and Robert Gentleman for providing the estrogen data http://www.bioconductor.org/data/experimental.html.

References

  1. Scholtens D, Gentleman R. Estrogen 2x2 Factorial Design. http://www.bioconductor.org/repository/devel/vignette/factDesign.pdf
  2. Scholtens D, Miron A, Merchant FM, Miller A, Miron PL, Iglehart JD, Gentleman R. Analyzing Factorial Designed Microarray Experiments. Journal of Multivariate Analysis. To appear.
  3. Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York.
  4. Smyth, G. K., Thorne, N. P. and Wettenhall J. (2005) limma: Linear Models for Microarray Data User's Guide. (See the estrogen case study.)

Return to list of Exercises