Chapter 3 Flow
# Publication-related
library(kableExtra)
options(kableExtra.html.bsTable = T)
# Bioconductor
library(FlowRepositoryR)
library(flowCore)
library(openCyto)
library(ggcyto)
library(FlowSOM)
# Tidyverse
library(fs)
library(janitor)
library(tidyverse)
library(reshape2)
# General purpose
library(curl)
# Visualization
library(umap)
3.1 Overview
Flow Cytometry is a common technique employed in the field of immunology that quantifies characteristics of individual cells. Early flow cytometer designs date back to the first of half of the 20th century and while it wasn’t until the late 1960’s that the predecessors of modern fluorescence-based flow cytometers were first patented, the basic principles of these instruments have not changed in the intervening 50 years until today (Givan 2010). They operate by passing single cells through a sheath fluid that arranges them in a single file order and then into an interrogation point (often at a rate of thousands of cells per second or more) where visible light from lasers excite antibody-bound fluorophores or fluorescent dyes specific to chemicals, usually proteins, on or within each cell.
The data collected from a flow cytometer for each cell includes measurements for forward scatter (FSC), side scatter (SSC), and fluorescent marker intensities. The first two of these, FSC and SSC, are briefly explained in 3.1 (Rowley 2012) and measure morphological properties of cells while fluorescent intensities are the focus of the remaining analysis in this chapter. This analysis will also focus on several other key concerns in working with flow cytometry data like compensation, transformation, and automation through the use of several related Bioconductor packages. Most of these packages make use of flowCore (Hahne et al. 2009) and while some of the details of its utilization will be explained, it’s probably worth giving the vignette a quick browse before going further since some familiarity of FCS files, flowFrame
constructs, and GatingHierarchy
semantics will be assumed (though much of this is pretty intuitive given working R knowledge).
For more references on flow cytometry analysis and applications, see:
- Practical Flow Cytometry (Shapiro 1994): An in-depth text with details on advanced topics such as cell sorting, single molecule detection, and applications of flow in hematology, immunology, microbiology, food science, bioterrorism, and more
- Translational Applications of Flow Cytometry in Clinical Practice (Jaye et al. 2012): An overview of clinical applications in cancer and immunological disease diagnosis as well as cell therapy
- Critical assessment of automated flow cytometry data analysis techniques (Aghaeepour et al. 2013): A presentation of results from an analytical competition (FlowCAP) held to assess the performance of algorithmic advancements in flow cytometry data processing
3.2 Exploring FlowRepository
While many sources of public flow cytometry data exist, the subject of this chapter will be experiments uploaded to FlowRepository. Specifically, the FlowRepositoryR package will be used to download two datasets related to “Optimized Multicolor Immunophenotyping Panels” (OMIP) publications. These panels consist of dyes and antibodies used to label samples relevant for one specific experimental purpose such as characterizing natural killer cell populations or chemokine expression on helper T cells.
Data available from these publications may be found in FlowRepository simply by searching for studies containing the keywork “OMIP”. Using the FlowRepository site is generally the simplest way to find relevant studies but programmatic access can also be helpful for doing more targeted searches such as for studies that have an associated FlowJo workspace, as shown below.
# Search for all studies matching the OMIP keyword (which returns a list of string IDs) and then
# fetch all corresponding details for the first 25
studies <- flowRep.search('OMIP') %>% sort %>% head(25) %>% map(flowRep.get)
# Study objects are implemented as S4 classes with id and name slots as well as an attachments slot,
# which is a list further containing S4 objects (with a "name" slot having values like "20151210_Workspace.jo")
tibble(
id=studies %>% map_chr(~ .@id),
name=studies %>% map_chr(~ .@name),
panel=name %>% str_extract('(?<=OMIP-)\\d+') %>% as.integer,
has_flowjo_ws=studies %>% map_lgl(~ map(.@attachments, ~.@name) %>% str_detect('\\.jo|\\.wsp') %>% any)
) %>%
filter(has_flowjo_ws) %>%
select(-has_flowjo_ws) %>%
arrange(panel) %>%
kable
id | name | panel |
---|---|---|
FR-FCM-ZZ2T | OMIP-016: Characterization of Antigen-Responsive macaque and human T-cells (Figure 1) | 16 |
FR-FCM-ZZ2V | OMIP-016: Characterization of Antigen-Responsive macaque and human T-cells (Suppl Figure 9) | 16 |
FR-FCM-ZZ3Y | OMIP-016: Characterization of Antigen-Responsive macaque and human T-cells (Online Fig 7&8) | 16 |
FR-FCM-ZZ3Z | OMIP-016: Characterization of Antigen-Responsive macaque and human T-cells (Online Figure 6) | 16 |
FR-FCM-ZZ36 | OMIP-018: Phenotyping with Chemokine Receptors | 18 |
FR-FCM-ZZ74 | OMIP-023: 10-Color, 13 antibody panel for in-depth phenotyping of human peripheral blood leukocytes | 23 |
FR-FCM-ZZK4 | OMIP-028: Rhesus macaque NK cell subset | 28 |
FR-FCM-ZZWU | OMIP-030: Characterization of eight major human CD4+ T helper subsets, including Tregs, via surface markers | 30 |
FR-FCM-ZZX9 | OMIP-035: Nonhuman Primate NK cell functional assay | 35 |
FR-FCM-ZYY6 | OMIP-039: Detection and analysis of human adaptive NKG2C+ natural killer cells | 39 |
FR-FCM-ZYBP | OMIP-043: Identification of Human Antibody Secreting subsets | 43 |
3.3 Tools
TODO:
- Intro openCyto and flowWorkspace
- Explain how R tools are most appropriate for automated workflows (not manual gating)
- Explain single-step automation vs whole workflow automation
- Describe FlowJo and Cytobank
3.4 Simple Analysis Automation (OMIP-021)
Initial Example: OMIP-021: Innate-like T-cell Panel (FR-FCM-ZZ9H)
data_dir <- dir_create("data/flow_files/OMIP-021")
dataset_id <- "FR-FCM-ZZ9H"
dataset_path <- fs::path(data_dir, dataset_id)
dataset <- flowRep.get(dataset_id)
if (!dir_exists(dataset_path))
dataset <- download(dataset, dirpath=dataset_path, only.files="Donor.*fcs", show.progress=F)
fr <- read.FCS(fs::path(getwd(), dataset_path, 'Donor1.fcs'))
fr %>% exprs %>% as.tibble %>% head(10) %>%
kable(format.args = list(digits=3)) %>%
kable_styling() %>%
scroll_box(width="100%")
FSC-A | FSC-H | SSC-A | SSC-H | 450/50Violet-A | 525/50Violet-A | 540/30Violet-A | 585/15Violet-A | 610/20Violet-A | 670/30Violet-A | 670/14Red-A | 730//45Red-A | 780/60Red-A | 530/30Blue-A | 710/50Blue-A | 582/15Yellow-A | 610/20Yellow-A | 670/30Yellow-A | 710/50Yellow-A | 780/60Yellow-A | Time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
119565 | 108322 | 82937 | 76142 | 1130 | 257 | 630 | 1002 | 3625 | 5465 | 7061 | 13215 | 6966 | 222.4 | 17786 | 803 | 418 | 20700 | 29263 | 19368 | 136 |
107004 | 77596 | 262143 | 207144 | 1223 | 732 | 995 | 169 | 721 | 114 | 1110 | 8055 | 2695 | 420.8 | 926 | 358 | 294 | 851 | 70563 | 2586 | 136 |
214908 | 139281 | 262143 | 226353 | 3218 | 1049 | 2240 | 1467 | 7903 | 10092 | 8914 | 26039 | 10930 | 561.1 | 19171 | 994 | 733 | 21668 | 88813 | 12031 | 136 |
119453 | 107496 | 67842 | 62072 | 2473 | 351 | 5828 | 14664 | 22003 | 3548 | 7316 | 6115 | 951 | 112.0 | 5504 | 184 | 448 | 4450 | 17505 | 1143 | 136 |
122345 | 111352 | 62728 | 56540 | 541 | 194 | 2601 | 735 | 1267 | 1095 | 3060 | 4466 | 4531 | 95.4 | 5793 | 435 | 440 | 8354 | 16969 | 3564 | 137 |
137030 | 124521 | 77133 | 70557 | 4062 | 471 | 10052 | 23228 | 41648 | 6317 | 13904 | 15326 | 2217 | 152.7 | 5801 | 189 | 829 | 10039 | 21626 | 2003 | 138 |
153801 | 122139 | 122174 | 80251 | 924 | 735 | 2110 | 1059 | 2827 | 2641 | 287 | 1277 | 611 | 4595.7 | 6182 | 175 | 201 | 822 | 33043 | 1301 | 138 |
177071 | 145150 | 168704 | 148426 | 731 | 679 | 1243 | 337 | 1045 | 201 | 415 | 9995 | 1659 | 291.3 | 1052 | 271 | 366 | 570 | 47515 | 953 | 138 |
165853 | 118245 | 122113 | 89253 | 2785 | 511 | 7996 | 2560 | 15636 | 9627 | 10654 | 29089 | 14198 | 337.8 | 20733 | 1203 | 2341 | 24269 | 40187 | 12859 | 138 |
121005 | 108937 | 72537 | 66157 | 255 | 205 | 3215 | 724 | 2316 | 153 | 361 | 829 | 141 | 166.0 | 4320 | 335 | 819 | 1027 | 21945 | 244 | 138 |
In this dataset, one way to identify lymphocytes is by looking at modes in the relationship between side and forward scatter:
To identify the lymphocyte cells above, openCyto
gating can be used to select the largest cluster of cells automatically and visualize what fraction of the population these cells constitute:
gate <- gate_flowClust_2d(fr, 'FSC-A', 'SSC-A', K=3, quantile=.95)
ggcyto(fr, aes(x='FSC-A', y='SSC-A')) + geom_hex(bins=100) + geom_gate(gate) + geom_stats()
3.4.1 Gating
By repeating the above process for all cell types of interest in the paper, the workflow can be reproduced via a GatingTemplate as shown below, which could also be built programmatically instead. The graphical representation of the template shows how cell types will be recursively defined based on 2 dimensional filters:
template <- 'alias,pop,parent,dims,gating_method,gating_args,collapseDataForGating,groupBy,preprocessing_method,preprocessing_args
Live,+,root,"FSC-A,525/50Violet-A","polyGate","x=c(0,3e5,3e5,1e5,.5e5,0),y=c(0,0,2.3,2.3,2,1.5)",,,,
Lymphocytes,+,Live,"FSC-A,SSC-A","flowClust.2d","K=3,quantile=.95",,,,
Singlets,+,Lymphocytes,"FSC-A,FSC-H","singletGate","maxit=1000,wider_gate=T,prediction_level=.999999999",,,,
gdTCells,+,Singlets,"670/30Violet-A,530/30Blue.A","flowClust.2d","K=3,target=c(2.5,2.5)",,,,
abTCells,+,Singlets,"670/30Violet-A,530/30Blue.A","flowClust.2d","K=3,target=c(2.5,1),quantile=0.95",,,,
maiTCells,+,abTCells,"582/15Yellow.A,540/30Violet.A","polyGate","x=c(2.7,5,5,2.7),y=c(2.5,2.5,5,5)",,,,'
template_path <- file_temp(ext='.csv')
write_lines(template, template_path)
gt <- gatingTemplate(template_path)
plot(gt)
The above template only outlines the gating workflow but to apply it to our data, these are the common steps:
- Apply compensation to the raw FCS data if necessary (not necessary in this case as the authors did this beforehand)
- Define channel transformations to make gating and visualization possible
- Define any custom gating functions needed in the workflow, which is particularly useful for setting manual gates
- Apply the gating template to the data at hand, which in this case is represented as a
flowFrame
but could also be aflowSet
representing a collection of experiments
Here is the realization of these steps for this data specifically:
# Define logical transformation for all fluorescent channels and build a "GatingSet", which is a wrapper
# class that binds numeric data with transformations and gating information
transformer <- transformerList(colnames(fr@description$SPILL), logicle_trans())
gs <- transform(GatingSet(flowSet(fr)), transformer)
# Define a custom polygon gating function, that is used in our template to deal with situations
# that are difficult to define an automated gate for
.polyGate <- function(fr, pp_res, channels, filterId="polygate", ...){
args <- list(...)
g <- data.frame(x=args$x, y=args$y)
colnames(g) <- channels
flowCore::polygonGate(.gate=g, filterId=filterId)
}
registerPlugins(fun=.polyGate, methodName='polyGate', dep=NA)
# Apply the gating to the data (this may take a couple minutes)
gating(gt, gs)
3.4.2 Results
Now that the workflow is finished, here is a look at all cell types identified:
This should then be comparable to what was in the OMIP-021 publication, and here are relevant figures demonstrating the similarity of the cell subsets captured:
3.5 Advanced Analysis Automation (OMIP-030)
Advanced Example: OMIP-030: Characterization of eight major human CD4+ T helper subsets, including Tregs, via surface markers (FR-FCM-ZZWU)
3.5.1 Gating
Load a flow workspace, which will contain all gaiting and cell expression information:
data_dir <- dir_create('data/flow_files/OMIP-030')
zip_url <- 'https://storage.googleapis.com/t-cell-data/flow_files/OMIP-030.zip'
zip_file <- fs::path(data_dir, 'OMIP-030.zip')
if (!file_exists(zip_file)) {
curl_download(zip_url, zip_file) %>% unzip(exdir=data_dir)
}
ws <- openWorkspace(fs::path(data_dir, 'OMIP-030.wsp'))
gs <- parseWorkspace(ws, name='TCells', path=data_dir, isNcdf=FALSE)
# Extract single GatingHierarchy from GatingSet
gh <- gs[[1]]
Show the correspondence between the flow cytometer channels and the marker names in the panel.
name | desc | |
---|---|---|
$P1 | FSC_A | NA |
$P2 | FSC_W | NA |
$P3 | SSC_A | NA |
$P4 | SSC_W | NA |
$P5 | Comp-LIVE_DEAD_BLUE_A | L+D |
$P6 | Comp-BRILLIANT_VIOLET_421_A | CD38 |
$P7 | Comp-V500_A | CD4 |
$P8 | Comp-BV_570_A | CD45RA |
$P9 | Comp-QDOT_605_A | CD3 |
$P10 | Comp-QDOT_655_A | CD8 |
$P11 | Comp-BV_711_A | CD25 |
$P12 | Comp-BV_785_A | CD196 |
$P13 | Comp-FITC_A | CD183 |
$P14 | Comp-PER_CP_E_FLUOR_710_A | CD161 |
$P15 | Comp-PE_A | CD194 |
$P16 | Comp-PE_TEXAS_RED_A | CD197 |
$P17 | Comp-PE_CY5_5_A | CD20 |
$P18 | Comp-PE_CY7_A | CD185-bio |
$P19 | Comp-APC_A | CCR10 |
$P20 | Comp-ALEXA_FLUOR_700_A | Ki67 |
$P21 | Comp-APC_CY7_A | CD127 |
$P22 | TIME | NA |
Plot the gating workflow starting from where T Cells are first identified:
TODO: Explain gating via differentiation model
3.5.2 Extracting Single Cell Data
Identify terminal nodes (i.e. cell populations) in the workflow which were, by convention, prefixed with either CD4+
or CD8+
and then use those nodes to extract single-cell data into a data frame (useful for custom visualizations).
TODO: make display of pop names better (and make states a map from type?)
# Extract all terminal node names into a vector
cell_pops <- getNodes(gs, path=1) %>% keep(str_detect(., '^CD[4|8]\\+ .*'))
# Functions used to parse properties out of node cell population
cell_pop_to_class <- function(x) case_when(
str_detect(x, '^CD4\\+') ~ 'CD4+',
str_detect(x, '^CD8\\+') ~ 'CD8+',
TRUE ~ NA_character_
)
cell_pop_to_mem_state <- function(x) case_when(
str_detect(x, '^CD[4|8]\\+ CM') ~ 'CM',
str_detect(x, '^CD[4|8]\\+ EM') ~ 'EM',
str_detect(x, '^CD[4|8]\\+ Naive$') ~ 'Naive',
str_detect(x, '^CD[4|8]\\+ Effector$') ~ 'EMRA',
TRUE ~ NA_character_
)
# Create vectors of all distinct cell states and types
cell_states <- cell_pops[!is.na(cell_pop_to_mem_state(cell_pops))]
cell_types <- discard(cell_pops, ~. %in% cell_states)
tibble(
population=cell_pops,
state=cell_pops %>% cell_pop_to_mem_state,
class=cell_pops %>% cell_pop_to_class
) %>% kable
population | state | class |
---|---|---|
CD4+ CM | CM | CD4+ |
CD4+ EM | EM | CD4+ |
CD4+ FH Th1-like | NA | CD4+ |
CD4+ FH Th2-like | NA | CD4+ |
CD4+ FH Th17-like | NA | CD4+ |
CD4+ ThG | NA | CD4+ |
CD4+ Th2g1 | NA | CD4+ |
CD4+ Th2g2 | NA | CD4+ |
CD4+ Th9 | NA | CD4+ |
CD4+ Th17 | NA | CD4+ |
CD4+ Th22 | NA | CD4+ |
CD4+ Effector | EMRA | CD4+ |
CD4+ Naive | Naive | CD4+ |
CD4+ Memory Treg | NA | CD4+ |
CD4+ Naive Treg | NA | CD4+ |
CD8+ mNKT-MAIT | NA | CD8+ |
CD8+ CM | CM | CD8+ |
CD8+ EM | EM | CD8+ |
CD8+ Effector | EMRA | CD8+ |
CD8+ Naive | Naive | CD8+ |
Next, extract data from the flowFrame
using the getIndices
function which, for a given list of m
node names, returns an n
x m
matrix where n
is the number of cells and m
logical columns contain TRUE/FALSE
values indicating whether or not the cell (row) was in that population.
extract_node_assignment <- function(nodes) {
nodes %>% map(~getIndices(gh, .)) %>% bind_cols %>%
set_names(nodes) %>% apply(., 1, function(x){
if (sum(x) == 1) nodes[x]
else NA_character_
})
}
# Separately query the single cells to determine state and type (as well as a few other metadata properties)
df_cell_meta <- list(type=cell_types, state=cell_states) %>%
map(extract_node_assignment) %>% as_tibble %>%
mutate(type=coalesce(type, state)) %>%
mutate(state=replace_na(cell_pop_to_mem_state(state), 'None')) %>%
mutate(class=cell_pop_to_class(type)) %>%
# Collapse these incorrect node names (there is a typo in the OMIP-030 CD4 gating figure)
# TODO: fix in original flowjo workspace
mutate(type=case_when(
type == 'CD4+ Th2g1' ~ 'CD4+ Th2',
type == 'CD4+ Th2g2' ~ 'CD4+ Th1',
TRUE ~ type
)) %>%
mutate(label=str_trim(str_replace(type, 'CD[4|8]\\+ ', '')))
# Concatenate cell meta data with expression data and remove any records for T cells
# with no known type (i.e. doublets, dead cells, B cells, CD4/CD8 double positive, etc.)
fr <- getData(gh)
name_map <- parameters(fr)@data %>% filter(str_detect(name, 'Comp'))
name_map <- set_names(name_map$name, name_map$desc)
df <- fr %>% exprs %>% as_tibble %>%
select(starts_with('Comp')) %>%
rename(!!name_map) %>%
cbind(df_cell_meta) %>%
filter(!is.na(type))
df %>% select(type, state, class, label, CD3, CD4, CD8) %>%
head(10) %>% kable
type | state | class | label | CD3 | CD4 | CD8 |
---|---|---|---|---|---|---|
CD8+ Naive | Naive | CD8+ | Naive | 145.1010 | 74.11814 | 171.20775 |
CD8+ Naive | Naive | CD8+ | Naive | 160.1456 | 55.52768 | 167.54243 |
CD8+ Naive | Naive | CD8+ | Naive | 150.9526 | 97.35362 | 170.99518 |
CD8+ Effector | EMRA | CD8+ | Effector | 128.7182 | 103.24777 | 171.34473 |
CD4+ Th1 | CM | CD4+ | Th1 | 163.2715 | 168.34281 | 52.61338 |
CD4+ Naive | Naive | CD4+ | Naive | 172.3580 | 171.73494 | 71.21534 |
CD4+ Th17 | EM | CD4+ | Th17 | 138.2115 | 159.26622 | 50.87339 |
CD8+ CM | CM | CD8+ | CM | 130.1044 | 110.77394 | 169.39673 |
CD4+ Naive | Naive | CD4+ | Naive | 153.2279 | 167.86806 | 51.67805 |
CD8+ CM | CM | CD8+ | CM | 154.5896 | 90.04521 | 187.45023 |
3.5.3 Preliminary Visualization
Plot the distribution of different cell types noting that the memory state exhibited by CD4+ and CD8+ cells is considered independent of the other primary phenotypes (helper, regulatory, innate-like, etc.). For example, this means that a Th17 cell can also exhibit CM or EM cell markers so in ?? this is shown as sums of constituent cell populations for each memory state.
df %>%
group_by(type, state) %>% tally %>% ungroup %>%
mutate(type=fct_reorder(type, n, sum), percent=n/sum(n)) %>%
ggplot(aes(x=type, y=percent, fill=state)) +
geom_bar(stat='identity') +
scale_fill_brewer(palette = 'Set1') +
scale_y_continuous(labels=scales::percent) +
labs(fill='Memory State', x='Primary Phenotype', y='Percentage', title='Cell Type Distribution') +
flow_theme + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Workflow gates for terminal cell populations overlaid on relevant expression distributions:
autoplot(gh, cell_pops[1:8], bins=50, strip.text='gate', axis_inverse_trans=FALSE) +
ggcyto_par_set(facet=facet_wrap(~name, scales='free'), limits = list(x=c(0, 225), y=c(0, 225))) +
flow_theme
Visualize expression distribution separation for each marker and cell type using the Resolution Metric (\(R_D\)) (Bushnell 2015) (Ortyn et al. 2006), which is used here to measure the difference in median (scaled by dispersion) between the expression of a marker for one cell type vs all others.
df_rd <- df %>% group_by(type) %>% do({
d <- .
dp <- df %>% filter(type != d$type[1]) %>% select_if(is.numeric)
dt <- d %>% select_if(is.numeric)
assertthat::are_equal(colnames(dp), colnames(dt))
stats <- lapply(colnames(dp), function(col) {
x <- pull(dt, col)
y <- pull(dp, col)
(median(x) - median(y)) / (mad(x) + mad(y))
})
names(stats) <- colnames(dp)
data.frame(stats)
}) %>% ungroup
# Cluster resolution metric vectors by cell type (for better visualization)
hc <- df_rd %>% select_if(is.numeric) %>%
as.matrix %>% dist %>% hclust
# Plot resolution metric with cell types on x axis and markers on y axis
df_rd %>% melt(id.vars='type') %>%
mutate(type=factor(type, levels=df_rd$type[hc$order])) %>%
ggplot(aes(x=type, y=variable, fill=value)) +
geom_tile(width=0.9, height=0.9) +
scale_fill_gradient2(
low='darkred', mid='white', high='darkblue',
guide=guide_colorbar(title=expression(R[D]))) +
flow_theme +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab('Phenotype') + ylab('Marker') +
ggtitle(expression("Resolution Metric (R"[D]*") by Cell Type and Marker"))
Show example distributions for individual cell type and marker pairs with low, near-zero, and high \(R_D\) values.
df_examples <- tribble(
~type, ~variable,
'CD4+ Memory Treg', 'CD25',
'CD4+ ThG ', 'CD161',
'CD8+ EM', 'CD197'
)
df_rd %>% melt(id.vars='type') %>%
mutate(variable=as.character(variable)) %>%
inner_join(df_examples) %>% rowwise() %>% do({
r <- .
df %>% select(one_of('type', r$variable)) %>% set_names(c('type', 'value')) %>%
mutate(group=case_when(type == r$type ~ 'Target Cell Type', TRUE ~ 'Other Cell Types')) %>%
# Downsample as more than 10k samples per-group becomes unnecessarily redundant
group_by(group) %>% do(sample_n(., min(nrow(.), 10000))) %>% ungroup %>%
mutate(rd=format(r$value, digits=3), marker=r$variable, type=r$type)
}) %>%
ggplot(aes(x=value, fill=group)) +
geom_density(alpha=.5) +
scale_fill_brewer(palette = 'Set1') +
labs(fill='Distribution', x='Marker Intensity', y='Density', title='Resolution Metric Examples') +
facet_wrap(~rd + marker + type, scales='free', labeller = label_bquote(.(type) * ":" ~ .(marker) ~ "(R"[D] ~ "=" ~ .(rd) * ")")) +
flow_theme
While univariate and bivariate measures of separation are useful for some cell populations, many are defined in the workflow based on combinations of several different positive and negative markers. For example, Th9 cells are defined as CCR10-
, CD185-bio-
, CD194-
and CD196+
, excluding markers to isolate T cells in the first place. This is definitely a type of cell that cannot be visualized easily based on one or two markers alone so instead, we will try a UMAP decomposition to project all fluorescent intensity measurements into a 2D space in such a way that preserves proximity between data points (cells in this case) on the manifold assumed to exist in the original high dimensional space. This manifold may not actually be two dimensional, which is often the case, but the 2D assumption will still provide results that give a sense of which cell types most readily differentiate from one another in this panel.
TODO: Compare to TSNE (runtimes and interpretation)
set.seed(1)
# Sample at most 300 cells of each type
df_samp <- df %>% group_by(type, state) %>% do(sample_n(., min(nrow(.), 300))) %>% ungroup
# Select all markers related to T cell differentation (i.e. exclude those for viability and isolation)
differentiation_markers <- df %>% select_if(is.numeric) %>%
select(-one_of('L+D', 'CD3', 'CD20')) %>% colnames
# Build UMAP configuration
umap_conf <- umap.defaults
umap_conf$min_dist <- .2
# Run decomposition and extract projection into 2D space with associated cell metadata
df_umap <- df_samp %>% select(one_of(differentiation_markers)) %>%
as.matrix %>% umap(config=umap_conf) %>% .$layout %>% as_tibble %>%
cbind(df_samp %>% select(type, state, class))
# Plot UMAP coordinates and cluster labels separately
ggplot(NULL) +
geom_point(data=df_umap, aes(x=V1, y=V2, color=type, shape=state), alpha=.3) +
geom_label(
data=df_umap %>% group_by(type) %>% summarize(cx=median(V1), cy=median(V2)) %>% ungroup,
aes(x=cx, y=cy, color=type, label=type), size=3, alpha=.9
) +
scale_alpha_continuous(guide='none') +
scale_color_discrete(guide='none') +
labs(
color='Primary Phenotype', shape='Memory State', x='', y='',
title=str_glue('OMIP-030 UMAP (N = {n})', n=nrow(df_umap))
) +
flow_theme
3.5.4 FlowSOM
The FlowSOM package implements an algorithm for automatic clustering and visualization of cytometry data based on self-organizing maps. Much like UMAP, this technique attempts to learn a low dimensional representation for large numbers of numerical measurements but unlike such methods used for visualization, this representation is not limited to 2 or 3 dimensions. The FlowSOM algorithm instead works by building a minimum spanning tree between clusters in the lower dimensional space, which again can have > 3 dimensions, before providing a variety of ways to interrogate the resulting graph structure. This structure also has the advantage of being able to be visualized easily in 2 dimensions through arrangement of the graph nodes into any plane that makes visualizing their connections possible.
There are a variety of similar algorithms such as SPADE, flowMeans, ACCENSE, PhenoGraph, and X-shift, but FlowSOM was chosen in this case because it offers similar or superior accuracy to all other methods while also executing much, much faster (Weber and Robinson 2016). For more information on FlowSOM, see the original publication or the Biconductor package vignette.
To use FlowSOM, the first step necessary is to decide which expression markers will be used for clustering. In this case, just as in the UMAP visualization, we want these markers to be specific to cell phenotypes but not viability or T cell isolation. FlowSOM is also built to operate well with flowCore data structures so the flowFrame containing the OMIP-030 data must also be filtered to the same cells studied thus far.
3.5.4.1 Execution
# Helpful function here will return a vector of indexes matching the given marker names
get_marker_indexes <- function(x) which(fr@parameters@data$desc %in% x)
# Create the vector of marker name indexes to use for clustering
mi_som <- get_marker_indexes(differentiation_markers)
# Filter the flowFrame to only cells with a known type
fr_som <- Subset(fr, !is.na(df_cell_meta$type))
stopifnot(nrow(fr_som) == nrow(df))
stopifnot(length(mi_som) == length(differentiation_markers))
# Show the chosen markers
differentiation_markers
[1] “CD38” “CD4” “CD45RA” “CD8” “CD25”
[6] “CD196” “CD183” “CD161” “CD194” “CD197”
[11] “CD185-bio” “CCR10” “Ki67” “CD127”
Run the FlowSOM algorithm.
set.seed(1)
# Prepare the flowFrame by scaling intensities (-mean/sd)
fr_som <- ReadInput(fr_som, scale=T)
# Use a 64 node (8x8) SOM instead of the default 100 node (10x10) SOM
# * 100 is too high for visualization of ~30 latent clusters
bi_som <- BuildSOM(fr_som, colsToUse=mi_som, xdim=8, ydim=8)
# Build the MST connecting the SOM nodes
mst_som <- BuildMST(bi_som, tSNE=FALSE)
3.5.4.2 Interrogation
One way to explore the resulting cluster graph is with individual nodes represented as “Star Charts”. These charts make it possible to see over/under expression for several markers in a cluster simultaneously. However, the 14 markers used for clustering make for a cluttered presentation so 6 key markers are chosen here to make it possible to see patterns in the clusters more clearly.
markers <- get_marker_indexes(c('CD4', 'CD8', 'CD25', 'CD197', 'CD45RA', 'CD185-bio'))
PlotStars(mst_som, view="MST", markers=markers, colorPalette=rainbow, legend=TRUE)
In the above figure, CD4
and CD8
over-expression roughly separate the graph into two halves (the red and green nodes). Within each of those partitions the CD197+
a.k.a. CCR7+
(blue) clusters are also evident and indicate more naive cell groups. Furthermore, within the CD4+
partition of the graph it is also clear that certain clusters have high expression levels of CD25
(cyan) as well as CD185-bio
(pink) indicating higher representation from regulatory and follicular helper cells, respectively.
Given no knowledge of the cell types a priori, a useful tool for investigating this representation of our dataset further is to simply specify positive/negative rule sets for specific cell types, based on a priori biological knowledge, and visualize which clusters best match those profiles. Two examples of this are shown below with profiles defined as “queries” of the SOM MST for both \(T_{reg}\) and \(T_{FH}\) cells.
TODO: Upload and reference supplemental OMIP-030 doc containing these profiles for each cell type
treg_query <- c(
"Comp-BV_711_A" = "high", # CD25
"Comp-V500_A" = "high", # CD4
"Comp-PE_CY7_A" = "low", # CD158-bio
"Comp-APC_CY7_A" = "low" # CD127
)
tfhc_query <- c(
"Comp-BV_711_A" = "low", # CD25
"Comp-V500_A" = "high", # CD4
"Comp-PE_CY7_A" = "high", # CD158-bio
"Comp-APC_CY7_A" = "high", # CD127
"Comp-APC_A" = "low", # CCR10
"Comp-BV_570_A" = "low" # CD45RA
)
treg_res <- QueryStarPlot(mst_som, treg_query, plot=FALSE)
tfhc_res <- QueryStarPlot(mst_som, tfhc_query, plot=FALSE)
bg_values <- factor(rep("Other", mst_som$map$nNodes), levels=c("Other", "Treg", "TFH"))
bg_values[treg_res$selected] <- 'Treg'
bg_values[tfhc_res$selected] <- 'TFH'
bg_colors <- c("#FFFFFF00", "#0000FF33", "#FF000033")
markers <- get_marker_indexes(c('CD4', 'CD8', 'CD25', 'CD185-bio', 'CCR10'))
PlotStars(
mst_som, view="MST", markers=markers, colorPalette=rainbow, legend=TRUE,
backgroundValues = bg_values, backgroundColor = bg_colors
)
The query results above show that \(T_{reg}\) and \(T_{FH}\) likely exist within this clustered representation and that the star chart profiles indicate expression patterns we would expect from those cell types. For a more detailed view of the distribution for a marker across the clusters it is also possible to color each one based on the values of a single marker. This makes small differences betweeen clusters more apparent and can help dissambiguate those that likely represent very similar cell types from one another.
3.5.5 Results
A more empirical validation of the FlowSOM results can also be done based on comparison to the manual gating results. Below, this is done by “Meta Clustering” the SOM nodes into some number of groups close to the number of different cell types determined through the manual process. FlowSOM makes visualizing the results of this possible through pie chart visualizations where for each node, the pie represents what portion of the manually gated cell types fall within it.
n_clusters <- df %>% group_by(type, state) %>% tally %>% filter(n > 1000) %>% nrow
meta_clustering <- metaClustering_consensus(mst_som$map$codes, k=n_clusters)
PlotPies(mst_som, df$type, view="MST", backgroundValues = as.factor(meta_clustering))
This figure demonstrates that certain nodes are dominated by one cell type (e.g. CD4+
naive and CD8+
naive) while others contain many different cell types in the manual results. It also shows that certain groups of an individual nodes can confidently be placed into less granular meta clusters (indiated by the background halos around the nodes) while for others the assignment is less convincingly related to a known cell profile.
The process for determining all of this with no manual gating information would usually involve deciding which known phenotype is most closely related to each cluster and/or meta-cluster so to continue further, this process will be emulated by electing the most common cell type from the manual gating result as the sole type for that inferred group. This then provides both a predicted and ground truth assignment for each cell which is aggregated into a comparison below showing how well the proportions of cells for each label align.
# meta_clustering = xdim*ydim length vector with meta cluster integers
# mst_som$map$mapping[,1] = Indexes of nodes (1 to xdim*ydim) for each of ~700k cells
df_cluster <- df %>% mutate(cluster=meta_clustering[mst_som$map$mapping[,1]]) %>%
group_by(cluster, type) %>% tally %>% ungroup %>%
group_by(cluster) %>% filter(n==max(n)) %>% ungroup %>% select(-n, cluster, predicted_type=type)
df_comparison <- df %>% mutate(cluster=meta_clustering[mst_som$map$mapping[,1]]) %>%
rename(actual_type=type) %>%
inner_join(df_cluster, by='cluster')
df_percents <- df_comparison %>% select(predicted=predicted_type, actual=actual_type) %>%
rowid_to_column('id') %>% melt(id.vars=c('id'), value.name='type') %>%
group_by(variable, type) %>% tally %>% ungroup %>%
group_by(variable) %>% mutate(percent=n/sum(n))
df_stats <- inner_join(
df_percents %>% select(-n) %>% spread(variable, percent) %>%
rename(predicted_percent=predicted, actual_percent=actual),
df_percents %>% select(-percent) %>% spread(variable, n) %>%
rename(predicted_n=predicted, actual_n=actual),
by='type'
)
fit <- lm(log10(actual_percent) ~ log10(predicted_percent), data=df_stats)
p_value <- format(summary(fit)$coef[2,4], digits=3)
rsq_value <- format(summary(fit)$r.squared, digits=3)
df_stats %>%
ggplot(aes(x=predicted_percent, y=actual_percent, color=type, size=actual_n)) +
geom_point() +
geom_abline(intercept=fit$coefficients[1], slope=fit$coefficients[2]) +
scale_x_log10(limits=c(.01, .3)) + scale_y_log10(limits=c(.01, .3)) +
scale_size_continuous(range=c(2, 5)) +
ggtitle(bquote(paste("Predicted vs Actual Cell Population Sizes ("*R^{2}~"="~.(rsq_value)*", p"~"="~.(p_value)*")" ))) +
flow_theme
References
Givan, Alice L. 2010. “Flow Cytometry: An Introduction.” In Methods in Molecular Biology, 1–29.
Rowley, Tom. 2012. “Flow Cytometry - a Survey and the Basics.” MATER METHODS 2 (August).
Hahne, Florian, Nolwenn LeMeur, Ryan R Brinkman, Byron Ellis, Perry Haaland, Deepayan Sarkar, Josef Spidlen, Errol Strain, and Robert Gentleman. 2009. “FlowCore: A Bioconductor Package for High Throughput Flow Cytometry.” BMC Bioinformatics 10 (April): 106.
Shapiro, Howard M. 1994. Practical Flow Cytometry. Wiley-Liss.
Jaye, David L, Robert A Bray, Howard M Gebel, Wayne A C Harris, and Edmund K Waller. 2012. “Translational Applications of Flow Cytometry in Clinical Practice.” J. Immunol. 188 (10): 4715–9.
Aghaeepour, Nima, Greg Finak, FlowCAP Consortium, DREAM Consortium, Holger Hoos, Tim R Mosmann, Ryan Brinkman, Raphael Gottardo, and Richard H Scheuermann. 2013. “Critical Assessment of Automated Flow Cytometry Data Analysis Techniques.” Nat. Methods 10 (3): 228–38.
Bushnell, Tim. 2015. Modern Flow Cytometry. ebook.
Ortyn, William E, Brian E Hall, Thaddeus C George, Keith Frost, David A Basiji, David J Perry, Cathleen A Zimmerman, David Coder, and Philip J Morrissey. 2006. “Sensitivity Measurement and Compensation in Spectral Imaging.” Cytometry A 69 (8): 852–62.
Weber, Lukas M, and Mark D Robinson. 2016. “Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data.” Cytometry A 89 (12): 1084–96.