Add Doublet Scores to a collection of ArrowFiles or an ArchRProject

For each sample in the ArrowFiles or ArchRProject provided, this function will independently assign inferred doublet information to each cell. This allows for removing strong heterotypic doublet-based clusters downstream. A doublet results from a droplet that contained two cells, causing the ATAC-seq data to be a mixture of the signal from each cell.

addDoubletScores(
  input = NULL,
  useMatrix = "TileMatrix",
  k = 10,
  nTrials = 5,
  dimsToUse = 1:30,
  LSIMethod = 1,
  scaleDims = FALSE,
  corCutOff = 0.75,
  knnMethod = "UMAP",
  UMAPParams = list(n_neighbors = 40, min_dist = 0.4, metric = "euclidean", verbose =
    FALSE),
  LSIParams = list(outlierQuantiles = NULL, filterBias = FALSE),
  outDir = getOutputDirectory(input),
  threads = getArchRThreads(),
  force = FALSE,
  parallelParam = NULL,
  verbose = TRUE,
  logFile = createLogFile("addDoubletScores")
)

Arguments

input: An ArchRProject object or a character vector containing the paths to the ArrowFiles to be used.
useMatrix: The name of the matrix to be used for performing doublet identification analyses. Options include "TileMatrix" and "PeakMatrix".
k: The number of cells neighboring a simulated doublet to be considered as putative doublets.
nTrials: The number of times to simulate nCell (number of cells in the sample) doublets to use for doublet simulation when calculating doublet scores.
dimsToUse: A vector containing the dimensions from the reducedDims object to use in clustering.
LSIMethod: A number or string indicating the order of operations in the TF-IDF normalization. Possible values are: 1 or "tf-logidf", 2 or "log(tf-idf)", and 3 or "logtf-logidf".
scaleDims: A boolean that indicates whether to z-score the reduced dimensions for each cell during the LSI method performed for doublet determination. This is useful for minimizing the contribution of strong biases (dominating early PCs) and lowly abundant populations. However, this may lead to stronger sample-specific biases since it is over-weighting latent PCs.
corCutOff: A numeric cutoff for the correlation of each dimension to the sequencing depth. If the dimension has a correlation to sequencing depth that is greater than the corCutOff, it will be excluded from analysis.
knnMethod: The name of the dimensionality reduction method to be used for k-nearest neighbors calculation. Possible values are "UMAP" or "LSI".
UMAPParams: The list of parameters to pass to the UMAP function if "UMAP" is designated to knnMethod. See the function umap in the uwot package.
LSIParams: The list of parameters to pass to the IterativeLSI() function. See IterativeLSI().
outDir: The relative path to the output directory for relevant plots/results from doublet identification.
threads: The number of threads to be used for parallel computing.
force: If the UMAP projection is not accurate (when R < 0.8 for the reprojection of the training data - this occurs when you have a very homogenous population of cells), setting force=FALSE will return -1 for all doubletScores and doubletEnrichments. If you would like to override this (not recommended!), you can bypass this warning by setting force=TRUE.
parallelParam: A list of parameters to be passed for biocparallel/batchtools parallel computing.
verbose: A boolean value that determines whether standard output is printed.
logFile: The path to a file to be used for logging ArchR output.

Examples


# Get Test ArchR Project
proj <- getTestProject()

# Add Doublet Scores for Small Project
proj <- addDoubletScores(proj, dimsToUse = 1:5, LSIParams = list(dimsToUse = 1:5, varFeatures=1000, iterations = 2))