10.3 Calling Peaks w/ TileMatrix

As mentioned previously, ArchR also implements its own native peak caller. While we have benchmarked this peak caller against MACS2 and note very similar performances, we do not recommend using this native peak caller unless absolutely necessary.

The ArchR native peak caller calls peaks on the 500-bp TileMatrix and we indicate to addReproduciblePeakSet() that we want to use this peak caller via the peakMethod parameter. Note that we are not storing the output into the projHeme4 object because we do not intend to keep this peak set and this analysis is only for illustrative purposes. Storage into the ArchRProject object would overwrite the previous reproducible peak set already stored in projHeme4.

projHemeTmp <- addReproduciblePeakSet(
    ArchRProj = projHeme4, 
    groupBy = "Clusters2",
    peakMethod = "Tiles",
    method = "p"
)

## ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-faba49485623-Date-2020-04-15_Time-10-27-15.log
## If there is an issue, please report to github with logFile!
## Calling Peaks with TileMatrix
## Group Coverages Already Computed Returning Groups, Set force = TRUE to Recompute!
## List of length 11
## names(11): B CD4.M CD4.N CLP Erythroid GMP Mono NK pDC PreB Progenitor
## 2020-04-15 10:27:16 : Peak Calling Parameters!, 0.012 mins elapsed.
## Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
## B B 439 433 2 182 251 150000
## CD4.M CD4.M 678 617 2 117 500 150000
## CD4.N CD4.N 1271 552 2 52 500 150000
## CLP CLP 387 387 2 82 305 150000
## Erythroid Erythroid 879 694 2 194 500 150000
## GMP GMP 793 751 2 251 500 150000
## Mono Mono 2632 1000 2 500 500 150000
## NK NK 851 801 2 301 500 150000
## pDC pDC 320 311 2 150 161 150000
## PreB PreB 351 351 2 40 311 150000
## Progenitor Progenitor 1650 672 2 172 500 150000
## 2020-04-15 10:27:24 : Computing Total Accessibility Across All Features, 0.138 mins elapsed.
## 2020-04-15 10:27:27 : Computing Pseudo-Grouped Tile Matrix, 0.19 mins elapsed.
## 2020-04-15 10:28:07 : Created Pseudo-Grouped Tile Matrix (0.474 GB), 0.866 mins elapsed.
## Expectation = 0.137769200114613Expectation = 0.170233276575844Expectation = 0.502430087836881Expectation = 0.0743484690298421Expectation = 0.231170400914268Expectation = 0.0293481232153502Expectation = 0.188425424281447Expectation = 0.0412872862125409Expectation = 0.369108556109225Expectation = 0.120760726012825Expectation = 0.33554874172927Expectation = 0.330837760307742Expectation = 0.41068352704434Expectation = 0.343789007051322Expectation = 0.571660173038985Expectation = 0.180700422552374Expectation = 0.126306931769154Expectation = 0.107652051338631Expectation = 0.200607316117261Expectation = 0.0248573103536859Expectation = 0.38762741617292Expectation = 0.0990605372969163
## 2020-04-15 10:28:51 : Creating Group Peak Sets with Annotations!, 1.59 mins elapsed.
## 2020-04-15 10:29:03 : Creating Union Peak Set with Annotations!, 1.789 mins elapsed.
## Annotating Peaks : Nearest Gene
## Annotating Peaks : Gene
## Annotating Peaks : TSS
## Annotating Peaks : GC
## [1] “plotting ggplot!”
## 2020-04-15 10:29:22 : Finished Creating Union Peak Set (271917)!, 2.104 mins elapsed.

We can similarly examine this merged peak set.

getPeakSet(projHemeTmp)

## GRanges object with 271917 ranges and 9 metadata columns:
## seqnames ranges strand | mlog10p Group
## |
## [1] chr1 752000-752499 * | 0.742 Mono
## [2] chr1 752500-752999 * | 6.561 Mono
## [3] chr1 758500-758999 * | 1.11 NK
## [4] chr1 762000-762499 * | 2.461 NK
## [5] chr1 762500-762999 * | 22.536 NK
## … … … … . … …
## [271913] chrX 154862000-154862499 * | 1.966 NK
## [271914] chrX 154862500-154862999 * | 1.334 NK
## [271915] chrX 154912500-154912999 * | 2.01 Erythroid
## [271916] chrX 154997000-154997499 * | 1.349 Progenitor
## [271917] chrX 154998000-154998499 * | 1.227 Mono
## distToGeneStart nearestGene peakType distToTSS nearestTSS
##
## [1] 10652 LINC00115 Distal 10652 uc001aau.3
## [2] 10152 LINC00115 Distal 10152 uc001aau.3
## [3] 4152 LINC00115 Distal 4152 uc001aau.3
## [4] 652 LINC00115 Promoter 652 uc001aau.3
## [5] 152 LINC00115 Promoter 152 uc001aau.3
## … … … … … …
## [271913] 19626 TMLHE Distal 19626 uc004cin.3
## [271914] 20126 TMLHE Distal 20126 uc004cin.3
## [271915] 70126 TMLHE Distal 70126 uc004cin.3
## [271916] 154626 TMLHE Distal 201 uc004cin.3
## [271917] 155626 TMLHE Distal 797 uc004cin.3
## GC idx
##
## [1] 0.376 1
## [2] 0.484 2
## [3] 0.56 3
## [4] 0.574 4
## [5] 0.684 5
## … … …
## [271913] 0.43 6212
## [271914] 0.42 6213
## [271915] 0.368 6214
## [271916] 0.542 6215
## [271917] 0.476 6216
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths

10.3.1 Comparing the two peak calling methods

To compare the merged peak set generated using MACS2 vs the merged peak set generated using the ArchR native TileMatrix peak caller, we can check the perfecent of overlapping peaks etc.

First, we check the percent of MACS2-called peaks that are overlapped by the TileMatrix-called peaks

length(subsetByOverlaps(getPeakSet(projHeme4), getPeakSet(projHemeTmp))) / length(getPeakSet(projHeme4))

## [1] 0.9627246

Then, we check the converse - the percent of TileMatrix-called peaks that are overlapped by MACS2-called peaks. You can see that this overlap is not as strong.

length(subsetByOverlaps(getPeakSet(projHemeTmp), getPeakSet(projHeme4))) / length(getPeakSet(projHemeTmp))

## [1] 0.7533365

If we increase the margins of the peaks to be wider (1000-bp peaks instead of 500-bp peaks), the percent of MACS2-called peaks that are overlapped does not change much.

length(subsetByOverlaps(resize(getPeakSet(projHeme4), 1000, "center"), getPeakSet(projHemeTmp))) / length(getPeakSet(projHeme4))

## [1] 0.9676687

But the percent of TileMatrix-called peaks overlapped by MACS2 does increase.

length(subsetByOverlaps(getPeakSet(projHemeTmp), resize(getPeakSet(projHeme4), 1000, "center"))) / length(getPeakSet(projHemeTmp))

## [1] 0.8287639