9.2 Identification of Marker Features

In addition to using prior knowledge of relevant marker genes for annotation of clusters, ArchR enables unbiased identification of marker features for any given cell groupings (for example, clusters). These features can be anything - peaks, genes (based on gene scores), or transcription factor motifs (based on chromVAR deviations). ArchR does this using the getMarkerFeatures() function which can take as input any matrix via the useMatrix parameter and it identifies features unique to the groups indicated by the groupBy parameter. If the useMatrix parameter is set to “GeneScoreMatrix”, then the function will identify the genes that appear to be uniquely active in each cell type. This provides an unbiased way of seeing which genes are predicted to be active in each cluster and can aid in cluster annotation.

As mentioned above, the same getMarkerFeatures() function can be used with any matrix stored in the Arrow files to identify features that are specific to certain cell groups. This is accomplished via the useMatrix parameter. For example, useMatrix = "TileMatrix" would identify genomic regions that are highly specific to a certain cell group and useMatrix = "PeakMatrix" would identify peaks that are highly specific to a certain cell group. Examples of how to use the getMarkerFeatures() function on other feature types are provided in later chapters.

9.2.1 How does marker feature identification happen?

This process of marker feature identification hinges on the selection of a group of bias-matched background cells for each cell group. Across all features, each cell group is compared to its own background group of cells to determine if the given cell group has significantly higher accessibility.

The selection of these background cell groups is critical to the success of this process and is performed across the multi-dimensional space provided by the user via the bias argument to getMarkerFeatures(). For each cell in the cell group, ArchR finds the nearest neighbor cell across the provided multi-dimensional space that is not a member of the given cell group and adds it to the background group of cells. In this way, ArchR creates a group of bias-matched cells that is as similar as possible to the given cell group, thus enabling a more robust determination of significance even if the group of cells is small.

The way ArchR does this is by taking all of the dimensions provided via the bias parameter and quantile normalizing their values to distribute the variance of each dimension across the same relative scale. Taking a toy example, if the parameters TSS and log10(Num Fragments) were provided to bias, the pre-quantile normalized values might look like this:

Here, the relative variance across the y-axis is very small compared to the variance across the x-axis. If we normalize these axes so that their values range from 0 to 1 instead, we make the relative variance much more equal. Importantly, we also change the nearest neighbors dramatically as indicated to the right of this plot.

ArchR normalizes all of the dimensions and uses euclidean distance in this normalized multidimensional space to find the nearest neighbors.