10.2 Calling Peaks w/ Macs2

As mentioned above, we generate a reproducible peak set in ArchR using the addReproduciblePeakSet() function. By default ArchR attempts to call peaks using MACS2; however, ArchR also implements its own native peak caller which could be used when MACS2 cannot be installed (for example, we have not successfully installed MACS2 on Windows) - this alternative peak calling method is described in the next section.

To call peaks using MACS2, ArchR must be able to find the MACS2 executable. First, ArchR looks in your PATH environment variable. If this is unsuccessful, ArchR attempts to determine if you have installed MACS2 with either pip or pip3. If neither of these is successful, ArchR gives up and provides an error message. If you have MACS2 installed and ArchR cannot find it, you should provide the path to the addReproduciblePeakSet() function via the pathToMacs2 parameter.

pathToMacs2 <- findMacs2()

## Searching For MACS2..
## Found with $path!

With the path to MACS2 identified, we can then create a reproducible merged peak set w/ MACS2 (~5-10 minutes). To avoid bias from pseudo-bulk replicates that have very few cells, we can provide a cutoff for the upper limit of the number of peaks called per cell via the peaksPerCell parameter. This prevents clusters with very few cells from contributing lots of low quality peaks to the merged peak set. There are many other parameters that can be tweaked in addReproduciblePeakSet() - try ?addReproduciblePeakSet for more information.

Each ArchRProject object can only contain a single peak set. As such, we assign the output of addReproduciblePeakSet() to our desired ArchRProject. If you would like to experiment with different peak sets, you must save a copy of your ArchRProject and thus also copy the Arrow files. While this does use more on-disk storage space, this is unavoidable given the structure of Arrow files and the storage of peak matrix information in the Arrow files.

projHeme4 <- addReproduciblePeakSet(
    ArchRProj = projHeme4, 
    groupBy = "Clusters2", 
    pathToMacs2 = pathToMacs2
)

## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-faba719c230c-Date-2020-04-15_Time-10-25-16.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with Macs2
## 2020-04-15 10:25:17 : Peak Calling Parameters!, 0.014 mins elapsed.
## Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B B 439 433 2 182 251 150000
## CD4.M CD4.M 678 617 2 117 500 150000
## CD4.N CD4.N 1271 552 2 52 500 150000
## CLP CLP 387 387 2 82 305 150000
## Erythroid Erythroid 879 694 2 194 500 150000
## GMP GMP 793 751 2 251 500 150000
## Mono Mono 2632 1000 2 500 500 150000
## NK NK 851 801 2 301 500 150000
## pDC pDC 320 311 2 150 161 150000
## PreB PreB 351 351 2 40 311 150000
## Progenitor Progenitor 1650 672 2 172 500 150000
## 2020-04-15 10:25:17 : Batching Peak Calls!, 0.014 mins elapsed.
## 2020-04-15 10:25:17 : Batch Execution w/ safelapply!, 0 mins elapsed.
## 2020-04-15 10:27:01 : Identifying Reproducible Peaks!, 1.74 mins elapsed.
## 2020-04-15 10:27:09 : Creating Union Peak Set!, 1.886 mins elapsed.
## Converged after 7 iterations!
## [1] “plotting ggplot!”
## 2020-04-15 10:27:15 : Finished Creating Union Peak Set (144009)!, 1.98 mins elapsed.

To retrieve this peak set as a GRanges object, we use the getPeakSet() function. This peak set contains an annotation for the group from which each peak originated. However, these annotations do not inherently mean that the given peak was only called in that group, rather that the annotated group had the highest normalized significance for that peak call.

getPeakSet(projHeme4)

## GRanges object with 144009 ranges and 12 metadata columns:
## seqnames ranges strand | score
## |
## Mono chr1 752499-752999 * | 24.54003
## NK chr1 762651-763151 * | 141.22064
## B chr1 801006-801506 * | 14.18461
## B chr1 805039-805539 * | 37.30365
## CLP chr1 845325-845825 * | 2.81281
## … … … … . …
## Erythroid chrX 154664540-154665040 * | 7.09786
## NK chrX 154807324-154807824 * | 9.38477
## PreB chrX 154840785-154841285 * | 3.29501
## PreB chrX 154842404-154842904 * | 7.68692
## NK chrX 154862017-154862517 * | 12.44
## replicateScoreQuantile groupScoreQuantile Reproducibility
##
## Mono 0.812 0.655 2
## NK 0.887 0.797 2
## B 0.7 0.418 2
## B 0.959 0.9 2
## CLP 0.706 0.311 2
## … … … …
## Erythroid 0.69 0.279 2
## NK 0.424 0.166 2
## PreB 0.669 0.274 2
## PreB 0.932 0.772 2
## NK 0.492 0.202 2
## GroupReplicate distToGeneStart nearestGene peakType
##
## Mono Mono._.scATAC_PBMC_R1 10152 LINC00115 Distal
## NK NK._.scATAC_PBMC_R1 0 LINC00115 Promoter
## B B._.scATAC_PBMC_R1 10925 FAM41C Distal
## B B._.scATAC_BMMC_R1 6892 FAM41C Intronic
## CLP CLP._.scATAC_BMMC_R1 9241 LINC02593 Distal
## … … … … …
## Erythroid Erythroid._.scATAC_BMMC_R1 100803 CLIC2 Distal
## NK NK._.scATAC_PBMC_R1 35047 TMLHE Intronic
## PreB PreB._.Rep2 1586 TMLHE Intronic
## PreB PreB._.Rep2 31 TMLHE Promoter
## NK NK._.scATAC_PBMC_R1 19644 TMLHE Distal
## distToTSS nearestTSS GC idx
##
## Mono 10152 uc001aau.3 0.483 1
## NK 0 uc001aau.3 0.6906 2
## B 10925 uc021oei.1 0.4371 3
## B 6892 uc021oei.1 0.7285 4
## CLP 1239 uc010nxu.2 0.7904 5
## … … … … …
## Erythroid 22387 uc004cim.2 0.515 3462
## NK 35047 uc004cin.3 0.525 3463
## PreB 1586 uc004cin.3 0.485 3464
## PreB 31 uc004cin.3 0.5888 3465
## NK 19644 uc004cin.3 0.4212 3466
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths