University of Copenhagen
Gordon Smyth
6 November 2007
This case study
The integrin beta7 data set is required for this lab. Set integrinbeta7
to be your R working directory:
> setwd("beta7")
Background. This experiment was conducted by the Erle Lab in UC San Francisco (Rodriguez, 2004). This experiment aims to study the cell adhesion molecule integrin alpha4/beta7 which assists in directing the migration of blood lymphocytes to the intestine and associated lymphoid tissues. The goal of the study is to identify differentially expressed genes between the alpha4/beta7+ and alpha4/beta7- memory T helper cells. Differentially expressed genes may play a role in the adhesion or migration of T cells.
Hybridisations. The dataset given here is a subset from the original consisting of arrays from 6 different subjects. Complete information about the array platform and data from each of the individual arrays is available from GEO (accession number GSE1039). Each array compared beta7+ and beta7- cell RNA from the same subject. The arrays were printed with 23,184 probes including the Operon Human version 2 set of 70-mer oligonucleotide probes and 1760 control spots (e.g., negative, positive and normalization controls spots). The microarrays were printed using a 12x4 arrangement of print-tips. Each print-tip grid consists of a 21x23 spot matrix that was printed with a single print-tip.
Image analysis. Arrays were scanned on an Axon GenePix 4000B scanner and the images processed using GenePix 5.0. The data comprises 6 GenePix gpr output files.
A microarray experiment consists of microarray probes which interrogate RNA targets. The first step in analyzing a microarray experiment is to specify the RNA targets for which microarray output is available. In limma, this is done by reading the `targets file'. This specifies the names of the image output files and gives information associated with the RNA targets.
> library(limma) > targets <- readTargets() > targets
Exercise. Read the Limma User's section on targets files.
> RG <- read.maimages(targets, source="genepix") > names(RG) > RG
Notice that the RGList object now contains all the components of a microarray experiments. The 'targets' frame describes the RNA targets, the 'genes' frame describes the probes or genes, and the expression data is contained in the matrices R, G, Rb and Gb.
Exercise: All data objects in limma have object-orientated features which allow them to behave like matrices in many ways, such as subsetting, cbind() and rbind(). Explore the matrix-like properties of RGList objects. Try for example:
> dim(RG) > ncol(RG) > colnames(RG) > RG[,1:2] > RG1 <- RG[,1:2] > RG2 <- RG[,5:6] > cbind(RG1,RG2)
We will start with a quick, but not very good, of this data:
> MA <- normalizeWithinArrays(RG) > design <- modelMatrix(targets, ref="b7minus") > fit <- lmFit(MA, design) > fit <- eBayes(fit) > topTable(fit)
This analysis suggests no significant genes after adjustment for multiple testing. How can we do better? We will do in the following sections.
Exercise: Section 10.2 of the limma User's Guide explains the components of a fitted model object. Identify the components of the linear model and empirical Bayes computation. What are the prior and residual degrees of freedom in this case? Try for example:
> class(fit) > names(fit) > fit$df.prior
> spottypes <- readSpotTypes() > spottypes > RG$genes$Status <- controlStatus(spottypes, RG) > table(RG$genes$Status) > plotMA(RG) > plotMA(RG,array=2)
Exercise. See the Limma User's Guide section on spot-types files. These provide a method for locating control spots or any other spots of interest by matching patterns using the probe IDs or names.
> RGne <- backgroundCorrect(RG, method="normexp", offset=25)
Background correction is more important than often appreciated because it impacts markedly on the variability of the log-ratios for low intensity spots. MA-plots using RGsu will show fanning out of log-ratios at low intensities. When using background subtraction, many spots are not even shown on the MA-plots because the background corrected intensities are negative leading to NA log-ratios. Fanning out of the log-ratios is undesirable for two reasons. Firstly it is undesirable than any log-ratios should be very variable, because this might lead those genes being falsely judged to be differentially expressed. Secondly, the empirical Bayes analysis implemented in eBayes() delivers most benefit when the variability of the log-ratios is as homogeneous as possible across genes.
Exercise. Examine closely the MA-plots from the default and normexp background correction methods. Notice that subtracting produces a decreasing fan effect with intensity while not background correcting produces an increasing fan effect. The 'normexp' produces a more balanced stabilization of the variances. It also preserves all the data.
> plotMA(RGne) > plotMA(RGne, array=2)
The basic design matrix simply records which arrays are dye-swapped:
> design
We will find however that it is much better to include a "dye effect" term in the model. For direct two-color designs, this always takes the form of an intercept term:
> design2 <- cbind(Dye=1, design) > design2
Repeat the normalization:
> MA <- normalizeWithinArrays(RGne)
Prepare to remove the control spots:
> isGene <- RG$genes$Status=="Gene"
Repeat the differential expression:
> fit <- lmFit(MA[isGene,], design2) > fit <- eBayes(fit) > topTable(fit, coef="b7plus") > summary(decideTests(fit))
Exercise: What are the prior degrees of freedom for this fit? How has it changed from the quick fit in Step 6 above?
Thanks to Yee Hwa Yang for making the beta7 data available.