Chapter 4 Microarrays

4.1 Overview

Our goal is to find some measurement we can perform to distinguish T cells from other cells, as well as T cell subsets from one another. Gene expression as measured by expression microarrays is one candidate measurement we could perform. In this chapter, we’ll seek out open expression microarray data generated from T cells and do some analysis to see if we can distinguish T cells and their subsets using this data.

We will create several tibbles over the course of this chapter, including:

  1. pGSE2770tidy: phenotype information for a single GSE GEO ID from ImmuneSigDB. Each row is a GSM GEO ID, i.e. a sample.

  2. eset_tib_genes: preprocessed expression data for the samples we’re analyzing. “Wide” version of table with one column per sample. Primary key is gene.

  3. eset_tidy_p: preprocessed expression data joined to phenotype data about the sample. “Tall” version of table. Primary key is (gene, sample).

  4. results_genes_tib: results of the limma differential expression analysis. Each row is a gene.

4.2 Managing downloads

Let’s put all of our downloaded data into a single directory.

4.5 Load expression data

And read them into an oligo ExpressionFeatureSet object. Note the feature data (i.e. the probeset IDs) are not stored in the feature data of the ExpressionSet object, but rather in a SQLlite database pd.hg.u133a somewhere on disk that the call to rma() will pick up.

4.6 Preprocess expression data

Run RMA on our batch of CEL files to be compared.

4.7 Map probesets to genes

Because we want to do our differential expression analysis at the gene level, not the probeset level, we need to map probeset IDs to gene symbols. To ensure comparability with the GSEA gene sets, we’ll use the annotations provided by the Broad that are used in GSEA.

We need to clean up the data a bit: the import produced a bogus column (X4), and we only want one gene symbol per probeset ID.

Finally we can join with our processed data!

We now remove probesets that don’t map to a gene symbol; note chip files encode this fact with ---, which we demonstrate first before the filter.

Some genes have multiple probesets that map to them. GSEA handles this situation by only keeping the maximum intensity value across all probesets that map to the gene. In GSEA terminology, we are using the max probe algorithm to collapse our pobesets at the gene level.

4.9 Differential expression analysis

Now we’ll take our new genes-only expression data and put it into a form expected by limma.

One last step: we need to make the design matrix for this comparison.

Finally we use limma to perform a differential expression analysis!

Let’s look at our results.

Volcano plot made with biobroom

Figure 4.2: Volcano plot made with biobroom

References

Godec, Jernej, Yan Tan, Arthur Liberzon, Pablo Tamayo, Sanchita Bhattacharya, Atul J Butte, Jill P Mesirov, and W Nicholas Haining. 2016. “Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation.” Immunity 44 (1): 194–206.