1.6 Creating Arrow Files

For the remainder of this tutorial, we will use data from a downsampled dataset of hematopoietic cells from Granja* et al. Nature Biotechnology 2019. This includes data from bone marrow mononuclear cells (BMMC), peripheral blood mononuclear cells (PBMC), and CD34+ hematopoietic stem and progenitor cells from bone marrow (CD34 BMMC).

This data is downloaded as fragment files which contain the start and end genomic coordinates of all aligned sequenced fragments. Fragment files are one of the base file types of the 10x Genomics analytical platform (and other platforms) and can be easily created from any BAM file. See the 10x Genomics website for information on making your own fragment files for input to ArchR.

Once we have our fragment files, we provide their paths as a character vector to createArrowFiles(). During creation, some basic metadata and matrices are added to each Arrow file including a “TileMatrix” containing insertion counts across genome-wide 500-bp bins (see addTileMatrix()) and a “GeneScoreMatrix” that stores predicted gene expression based on weighting insertion counts in tiles nearby a gene promoter (see addGeneScoreMatrix()).

The tutorial data can be downloaded using the getTutorialData() function. The tutorial data is approximately 0.5 GB in size. If you have already downloaded the tutorial in the current working directory, ArchR will bypass downloading.

library(ArchR)

inputFiles <- getTutorialData("Hematopoiesis")
inputFiles

## scATAC_BMMC_R1
## “HemeFragments/scATAC_BMMC_R1.fragments.tsv.gz”
## scATAC_CD34_BMMC_R1
## “HemeFragments/scATAC_CD34_BMMC_R1.fragments.tsv.gz”
## scATAC_PBMC_R1
## “HemeFragments/scATAC_PBMC_R1.fragments.tsv.gz”

As always, before starting a project we must set the ArchRGenome and default threads for parallelization.

addArchRGenome("hg19")

## Setting default genome to Hg19.

addArchRThreads(threads = 16) 

## Setting default number of Parallel threads to 16.

Now we will create our Arrow Files which will take 10-15 minutes. For each sample, this step will:

  1. Read accessible fragments from the provided input files.
  2. Calculate quality control information for each cell (i.e. TSS enrichment scores and nucleosome info).
  3. Filter cells based on quality control parameters.
  4. Create a genome-wide TileMatrix using 500-bp bins.
  5. Create a GeneScoreMatrix using the custom geneAnnotation that was defined when we called addArchRGenome().
ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  filterTSS = 4, #Dont set this too high because you can always increase later
  filterFrags = 1000, 
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)

## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## ArchR logging to : ArchRLogs/ArchR-createArrows-dfa159ddbf6e-Date-2020-04-15_Time-09-21-27.log
## If there is an issue, please report to github with logFile!
## Cleaning Temporary Files
## 2020-04-15 09:21:28 : Batch Execution w/ safelapply!, 0 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-createArrows-dfa159ddbf6e-Date-2020-04-15_Time-09-21-27.log

We can inspect the ArrowFiles object to see that it is actually just a character vector of Arrow file paths.

ArrowFiles

## “scATAC_BMMC_R1.arrow” “scATAC_CD34_BMMC_R1.arrow”
## “scATAC_PBMC_R1.arrow”