1.5 Getting Set Up

The first thing we do is change to our desired our working directory, set the number of threads we would like to use, and load our gene and genome annotations. Depending on the configuration of your local environment, you may need to modify the number of threads used below in addArchRThreads(). By default ArchR uses half of the total number of threads available but you can adjust this manually as you see fit. If you are using Windows, the usable threads will automatically be set to 1 because the parallel processing in ArchR is built for Unix-based operating systems.

First, we load the ArchR library. If this fails, you have not properly installed ArchR and should revisit the installation instructions.

library(ArchR)

Next, we set the default number of threads for ArchR functions. This is something you will have to do during each new R session. We recommend setting threads to 1/2 to 3/4 of the total available cores. The memory usage in ArchR will often scale with the number of threads used so allowing ArchR to use more threads will also lead to higher memory usage.

addArchRThreads(threads = 16) 

## Setting default number of Parallel threads to 16.

Then, we set the genome to be used for gene and genome annotations. As above, this is something you will have to do during each new R session. Of course, this genome version must match the genome version that was used for alignment. For the data used in this tutorial, we will use the hg19 reference genome but ArchR natively supports additional genome annotations and custome genome annotations as outlined in the next section.

addArchRGenome("hg19")

## Setting default genome to Hg19. ## Setting a Genome and GeneAnnotation

ArchR requires gene and genome annotations to do things such as calculate TSS enrichment scores, nucleotide content, and gene activity scores. Because our tutorial dataset uses scATAC-seq data that has already been aligned to the hg19 reference genome, we have set “hg19” as the default genome in the previous section. However, ArchR supports “hg19”, “hg38”, “mm9”, and “mm10” natively and you can create your own genome and gene annotations using the createGeneAnnotation() and createGenomeAnnotation() functions.

Providing this information to ArchR is streamlined through the addArchRGenome() function. This function tells ArchR that, for all analyses in the current session, it should use the genomeAnnotation and geneAnnotation associated with the defined ArchRGenome. Each of the natively supported genomes are composed of a BSgenome object that defines the genomic coordinates and sequence of each chromosome, a GRanges object containing a set of blacklisted regions, a TxDb object that defines the positions and structures of all genes, and an OrgDb object that provides a central gene identifier and contains mappings between this identifier and other kinds of identifiers.

Below are examples of how to load gene and genome annotations for the natively supported genomes as well as information on their BSgenome and blacklist components.


The precompiled version of the hg19 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg19, TxDb.Hsapiens.UCSC.hg19.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg19 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg19 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg19 genome:

addArchRGenome("hg19")

## Setting default genome to Hg19.


The precompiled version of the hg38 genome in ArchR uses BSgenome.Hsapiens.UCSC.hg38, TxDb.Hsapiens.UCSC.hg38.knownGene, org.Hs.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the hg38 v2 blacklist regions and from mitochondrial regions that show high mappability to the hg38 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled hg38 genome:

addArchRGenome("hg38")

## Setting default genome to Hg38.


The precompiled version of the mm9 genome in ArchR uses BSgenome.Mmusculus.UCSC.mm9, TxDb.Mmusculus.UCSC.mm9.knownGene, org.Mm.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the mm9 v1 blacklist regions from Anshul Kundaje and from mitochondrial regions that show high mappability to the mm9 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled mm9 genome:

addArchRGenome("mm9")

## Setting default genome to Mm9.


The precompiled version of the mm10 genome in ArchR uses BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene, org.Mm.eg.db, and a blacklist that was merged using ArchR::mergeGR() from the mm10 v2 blacklist regions and from mitochondrial regions that show high mappability to the mm10 nuclear genome from Caleb Lareau and Jason Buenrostro. To set a global genome default to the precompiled mm10 genome:

addArchRGenome("mm10")

## Setting default genome to Mm10.


1.5.1 Creating a Custom ArchRGenome

As described above, an ArchRGenome consists of a genome annotation and a gene annotation.

To create a custom genome annotation, we use createGenomeAnnotation(). To do this, you will need the following information:

  1. A BSgenome object which contains the sequence information for a genome. These are commonly Bioconductor packages (for example, BSgenome.Hsapiens.UCSC.hg38) that can be easily found with google.
  2. A GRanges genomic ranges object containing a set of blacklisted regions that will be used to filter out unwanted regions from downstream analysis. This is not required but is recommended. For information on how blacklists are created, see the publication on the ENCODE blacklists.

For example, if we wanted to create a custom genome annotation from Drosophila melanogaster, we would first identify and install and load the relevant BSgenome object.

if (!requireNamespace("BSgenome.Dmelanogaster.UCSC.dm6", quietly = TRUE)){
  BiocManager::install("BSgenome.Dmelanogaster.UCSC.dm6")
}
library(BSgenome.Dmelanogaster.UCSC.dm6)

## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
## Attaching package: ‘Biostrings’
## The following object is masked from ‘package:base’: strsplit

Then we create a genome annotation from this BSgenome object.

genomeAnnotation <- createGenomeAnnotation(genome = BSgenome.Dmelanogaster.UCSC.dm6)

## Getting genome..
## Getting chromSizes..
## Getting blacklist..
## Blacklist not downloaded! Continuing without, be careful for downstream biases..

Examining this object shows the constituent parts of an ArchR genome annotation.

genomeAnnotation

## List of length 3
## names(3): genome chromSizes blacklist

To create a custom gene annotation, we use createGeneAnnotation(). To do this, you will need the following information:

  1. A TxDb object (transcript database) from Bioconductor which contains information for gene/transcript coordinates.
  2. An OrgDb object (organism database) from Bioconductor which provides a unified framework to map between gene names and various gene identifiers.

Continuing with out example from Drosophila melanogaster, we first install and load the relevant TxDb and OrgDb objects.

if (!requireNamespace("TxDb.Dmelanogaster.UCSC.dm6.ensGene", quietly = TRUE)){
  BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
}
if (!requireNamespace("org.Dm.eg.db", quietly = TRUE)){
  BiocManager::install("org.Dm.eg.db")
}
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)

## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi

library(org.Dm.eg.db)

Then we create the gene annotation object.

geneAnnotation <- createGeneAnnotation(TxDb = TxDb.Dmelanogaster.UCSC.dm6.ensGene, OrgDb = org.Dm.eg.db)

## Getting Genes..
## Determined Annotation Style = ENSEMBL
## Getting Exons..
## Getting TSS..

Examining this object shows the constituent parts of an ArchR gene annotation.

geneAnnotation

## List of length 3
## names(3): genes exons TSS

Alternatively, if you dont have a TxDb and OrgDb object, you can create a geneAnnotation object from the following information :

  1. A “genes” object - a GRanges object containing gene coordinates (start to end). This object must have a symbols column matching the symbols column of the “exons” object described below.
  2. An “exons” object - GRanges object containing gene exon coordinates. Must have a symbols column matching the symbols column of the “genes” object described above.
  3. A GRanges object containing standed transcription start site (TSS) coordinates.
geneAnnotation <- createGeneAnnotation(
  TSS = geneAnnotation$TSS, 
  exons = geneAnnotation$exons, 
  genes = geneAnnotation$genes
)

1.5.2 Using Non-standard Genomes in ArchR

ArchR implements some checks to prevent you from doing things that we consider “out of the norm”. One of these checks forces the seqnames of your genome annotation to start with “chr”. This is true in most applications but there are some genomes (for example Maize) which do not use “chr” as a prefix for the different chromosomes. To perform any ArchR analyses on such a genome, you have to tell ArchR to ignore the chromosome prefixes. To do this, you must run addArchRChrPrefix(chrPrefix = FALSE) prior to creating your Arrow files. This will turn off the requirement for “chr” prefixes on seqnames globally in the current R session. Please keep in mind ArchR converts chromosome/seqnames to character by default. Thus, if your seqnames are just integers, you will need to supply them as characters i.e. useSeqnames = c("1", "2", "3") instead of useSeqnames = c(1, 2, 3). You can always check if chromosome prefixes are required in the current R session using getArchRChrPrefix().