This function will compute an iterative LSI dimensionality reduction on an ArchRProject.

addIterativeLSI(
  ArchRProj = NULL,
  useMatrix = "TileMatrix",
  name = "IterativeLSI",
  iterations = 2,
  clusterParams = list(resolution = c(2), sampleCells = 10000, maxClusters = 6, n.start
    = 10),
  firstSelection = "top",
  depthCol = "nFrags",
  varFeatures = 25000,
  dimsToUse = 1:30,
  LSIMethod = 2,
  scaleDims = TRUE,
  corCutOff = 0.75,
  binarize = TRUE,
  outlierQuantiles = c(0.02, 0.98),
  filterBias = TRUE,
  sampleCellsPre = 10000,
  projectCellsPre = FALSE,
  sampleCellsFinal = NULL,
  selectionMethod = "var",
  scaleTo = 10000,
  totalFeatures = 5e+05,
  filterQuantile = 0.995,
  excludeChr = c(),
  saveIterations = TRUE,
  UMAPParams = list(n_neighbors = 40, min_dist = 0.4, metric = "cosine", verbose =
    FALSE, fast_sgd = TRUE),
  nPlot = 10000,
  outDir = getOutputDirectory(ArchRProj),
  threads = getArchRThreads(),
  seed = 1,
  verbose = TRUE,
  force = FALSE,
  logFile = createLogFile("addIterativeLSI")
)

Arguments

ArchRProj

An ArchRProject object.

useMatrix

The name of the data matrix to retrieve from the ArrowFiles associated with the ArchRProject. Valid options are "TileMatrix" or "PeakMatrix".

name

The name to use for storage of the IterativeLSI dimensionality reduction in the ArchRProject as a reducedDims object.

iterations

The number of LSI iterations to perform.

clusterParams

A list of Additional parameters to be passed to addClusters() for clustering within each iteration. These params can be constant across each iteration, or specified for each iteration individually. Thus each param must be of length == 1 or the total number of iterations - 1. PLEASE NOTE - We have updated these params to resolution=2 and maxClusters=6! To use previous settings use resolution=0.2 and maxClusters=NULL.

firstSelection

First iteration selection method for features to use for LSI. Either "Top" for the top accessible/average or "Var" for the top variable features. "Top" should be used for all scATAC-seq data (binary) while "Var" should be used for all scRNA/other-seq data types (non-binary).

depthCol

A column in the ArchRProject that represents the coverage (scATAC = unique fragments, scRNA = unique molecular identifiers) per cell. These values are used to minimize the related biases in the reduction related. For scATAC we recommend "nFrags" and for scRNA we recommend "Gex_nUMI".

varFeatures

The number of N variable features to use for LSI. The top N features will be used based on the selectionMethod.

dimsToUse

A vector containing the dimensions from the reducedDims object to use in clustering.

LSIMethod

A number or string indicating the order of operations in the TF-IDF normalization. Possible values are: 1 or "tf-logidf", 2 or "log(tf-idf)", and 3 or "logtf-logidf".

scaleDims

A boolean that indicates whether to z-score the reduced dimensions for each cell. This is useful forminimizing the contribution of strong biases (dominating early PCs) and lowly abundant populations. However, this may lead to stronger sample-specific biases since it is over-weighting latent PCs. If set to NULL this will scale the dimensions based on the value of scaleDims when the reducedDims were originally created during dimensionality reduction. This idea was introduced by Timothy Stuart.

corCutOff

A numeric cutoff for the correlation of each dimension to the sequencing depth. If the dimension has a correlation to sequencing depth that is greater than the corCutOff, it will be excluded from analysis.

binarize

A boolean value indicating whether the matrix should be binarized before running LSI. This is often desired when working with insertion counts.

outlierQuantiles

Two numerical values (between 0 and 1) that describe the lower and upper quantiles of bias (number of acessible regions per cell, determined by nFrags or colSums) to filter cells prior to LSI. For example a value of c(0.02, 0.98) results in the cells in the bottom 2 percent and upper 98 percent to be filtered prior to LSI. These cells are then projected back in the LSI subspace. This prevents spurious 'islands' that are identified based on being extremely biased. These quantiles are also used for sub-sampled LSI when determining which cells are used.

filterBias

A boolean indicating whether to drop bias clusters when computing clusters during iterativeLSI.

sampleCellsPre

An integer specifying the number of cells to sample in iterations prior to the last in order to perform a sub-sampled LSI and sub-sampled clustering. This greatly reduced memory usage and increases speed for early iterations.

projectCellsPre

A boolean indicating whether to reproject all cells into the sub-sampled LSI (see sampleCellsPre). Setting this to FALSE allows for using the sub-sampled LSI for clustering and variance identification. If TRUE the cells are all projected into the sub-sampled LSI and used for cluster and variance identification.

sampleCellsFinal

An integer specifying the number of cells to sample in order to perform a sub-sampled LSI in final iteration.

selectionMethod

The selection method to be used for identifying the top variable features. Valid options are "var" for log-variability or "vmr" for variance-to-mean ratio.

scaleTo

Each column in the matrix designated by useMatrix will be normalized to a column sum designated by scaleTo prior to variance calculation and TF-IDF normalization.

totalFeatures

The number of features to consider for use in LSI after ranking the features by the total number of insertions. These features are the only ones used throught the variance identification and LSI. These are an equivalent when using a TileMatrix to a defined peakSet.

filterQuantile

A number 0,1 that indicates the quantile above which features should be removed based on insertion counts prior

excludeChr

A string of chromosomes to exclude for iterativeLSI procedure. to the first iteration of the iterative LSI paradigm. For example, if filterQuantile = 0.99, any features above the 99th percentile in insertion counts will be ignored for the first LSI iteration.

saveIterations

A boolean value indicating whether the results of each LSI iterations should be saved as compressed .rds files in the designated outDir.

UMAPParams

The list of parameters to pass to the UMAP function if "UMAP" if saveIterations=TRUE. See the function uwot::umap().

nPlot

If saveIterations=TRUE, how many cells to sample make a UMAP and plot for each iteration.

outDir

The output directory for saving LSI iterations if desired. Default is in the outputDirectory of the ArchRProject.

threads

The number of threads to be used for parallel computing.

seed

A number to be used as the seed for random number generation. It is recommended to keep track of the seed used so that you can reproduce results downstream.

verbose

A boolean value that determines whether standard output includes verbose sections.

force

A boolean value that indicates whether or not to overwrite relevant data in the ArchRProject object.

logFile

The path to a file to be used for logging ArchR output.