15.3 Peak2GeneLinkage with ArchR

Similar to co-accessibility, ArchR can also identify so-called “peak-to-gene links”. The primary differences between peak-to-gene links and co-accessibility is that co-accessibility is an ATAC-seq-only analysis that looks for correlations in accessibility between two peaks while peak-to-gene linkage leverages integrated scRNA-seq data to look for correlations between peak accessibility and gene expression. These represent orthogonal approaches to a similar problem. However, because peak-to-gene linkage correlates scATAC-seq and scRNA-seq data, we often think of these links as more relevant to gene regulatory interactions.

To identify peak-to-gene links in ArchR, we use the addPeak2GeneLinks() function.

projHeme5 <- addPeak2GeneLinks(
    ArchRProj = projHeme5,
    reducedDims = "IterativeLSI"
)

## ArchR logging to : ArchRLogs/ArchR-addPeak2GeneLinks-11a7017ae41d-Date-2020-04-15_Time-11-57-10.log
## If there is an issue, please report to github with logFile!
## 2020-04-15 11:57:11 : Getting Available Matrices, 0.006 mins elapsed.
## 2020-04-15 11:57:11 : Filtered Low Prediction Score Cells (1049 of 10251, 0.102), 0.004 mins elapsed.
## 2020-04-15 11:57:12 : Computing KNN, 0.009 mins elapsed.
## 2020-04-15 11:57:12 : Identifying Non-Overlapping KNN pairs, 0.016 mins elapsed.
## 2020-04-15 11:57:15 : Identified 491 Groupings!, 0.07 mins elapsed.
## 2020-04-15 11:57:15 : Getting Group RNA Matrix, 0.071 mins elapsed.
## 2020-04-15 11:57:43 : Getting Group ATAC Matrix, 0.536 mins elapsed.
## 2020-04-15 11:58:18 : Normalizing Group Matrices, 1.112 mins elapsed.
## 2020-04-15 11:58:24 : Finding Peak Gene Pairings, 1.218 mins elapsed.
## 2020-04-15 11:58:25 : Computing Correlations, 1.227 mins elapsed.
## 2020-04-15 11:58:34 : Completed Peak2Gene Correlations!, 1.382 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-addPeak2GeneLinks-11a7017ae41d-Date-2020-04-15_Time-11-57-10.log

We can then retrieve these peak-to-gene links in a similar fashion to how we retrieved co-accessibility interactions by using the getPeak2GeneLinks() function. As we saw previously, this function allows for a user-specified cutoff for correlation and resolution for linkages.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = FALSE
)

When returnLoops is set to false, this function returns a DataFrame object anaolgous to the DataFrame object returned by getCoAccessibility(). The primary difference is that the indexes for the scATAC-seq peaks are stored in the idxATAC column and the indexes for the scRNA-seq gene are stored in the idxRNA column.

p2g

## DataFrame with 43754 rows and 6 columns
## idxATAC idxRNA Correlation FDR
##
## 1 47 5 0.549552663393716 1.34094093110629e-38
## 2 3 6 0.487418258348982 2.1460798658766e-29
## 3 3 7 0.821369314290254 1.70141157082175e-118
## 4 47 7 0.569899085935183 4.61958766180263e-42
## 5 59 7 0.549954043409585 1.15152278593831e-38
## … … … … …
## 43750 143964 18590 0.520703047216992 4.37554114264876e-34
## 43751 143977 18590 0.546301068655939 4.54936701954597e-38
## 43752 144004 18594 0.47934214168793 2.47516776950818e-28
## 43753 143999 18598 0.49092225993235 7.28579053551407e-30
## 43754 144005 18598 0.493389088498317 3.37643947034054e-30
## VarQATAC VarQRNA
##
## 1 0.948753202924817 0.793290683296597
## 2 0.253206396822421 0.445890005913661
## 3 0.253206396822421 0.42728885543788
## 4 0.948753202924817 0.42728885543788
## 5 0.916137185870328 0.42728885543788
## … … …
## 43750 0.629044018082203 0.960701037578625
## 43751 0.701518655084057 0.960701037578625
## 43752 0.814587977140318 0.802537497983979
## 43753 0.0392197709865356 0.901295629267244
## 43754 0.158948399058392 0.901295629267244

This peak-to-gene linkage DataFrame also has a metadata component containing a GRanges object of the relevant peaks. The indexes of idxATAC mentioned above apply to this GRanges object.

metadata(p2g)[[1]]

## GRanges object with 144009 ranges and 0 metadata columns: ## seqnames ranges strand
##
## [1] chr1 752499-752999 *
## [2] chr1 762651-763151 *
## [3] chr1 801006-801506 *
## [4] chr1 805039-805539 *
## [5] chr1 845325-845825 *
## … … … …
## [144005] chrX 154664540-154665040 *
## [144006] chrX 154807324-154807824 *
## [144007] chrX 154840785-154841285 *
## [144008] chrX 154842404-154842904 *
## [144009] chrX 154862017-154862517 *
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths

If we set returnLoops = TRUE, then getPeak2GeneLinks() will return a loop track GRanges object that connects the peak and gene. As for co-accessibility, the start and end of the IRanges object represent the position of the peak and gene being linked. When resolution = 1, this links the center of the peak to the single-base TSS of the gene.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1,
    returnLoops = TRUE
)

p2g[[1]]

## GRanges object with 43695 ranges and 2 metadata columns: ## seqnames ranges strand | value
## |
## [1] chr1 762901-948847 * | 0.533350285896763
## [2] chr1 801256-895967 * | 0.487418258348982
## [3] chr1 801256-901877 * | 0.821369314290254
## [4] chr1 852367-955503 * | 0.487693663181145
## [5] chr1 894679-968584 * | 0.549552663393716
## … … … … . …
## [43691] chrX 153962451-153979858 * | 0.453720046871409
## [43692] chrX 153980218-153991031 * | 0.546301068655939
## [43693] chrX 154255064-154493795 * | 0.47934214168793
## [43694] chrX 154356425-154444701 * | 0.49092225993235
## [43695] chrX 154444701-154664790 * | 0.493389088498317
## FDR
##
## [1] 5.20045729958651e-36
## [2] 2.1460798658766e-29
## [3] 1.70141157082175e-118
## [4] 1.97244775926907e-29
## [5] 1.34094093110629e-38
## … …
## [43691] 3.78754241443136e-25
## [43692] 4.54936701954597e-38
## [43693] 2.47516776950818e-28
## [43694] 7.28579053551407e-30
## [43695] 3.37643947034054e-30
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths

We can alternatively decrease the resolution of these links by setting resolution = 1000. This is primarily useful for plotting the links as a browser tracks because there are instances where many nearby peaks all link to the same gene and this can be difficult to visualize.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 1000,
    returnLoops = TRUE
)

p2g[[1]]

## GRanges object with 42126 ranges and 2 metadata columns:
## seqnames ranges strand | value
## |
## [1] chr1 762500-948500 * | 0.533350285896763
## [2] chr1 801500-895500 * | 0.487418258348982
## [3] chr1 801500-901500 * | 0.821369314290254
## [4] chr1 852500-955500 * | 0.487693663181145
## [5] chr1 894500-968500 * | 0.549552663393716
## … … … … . …
## [42122] chrX 153962500-153979500 * | 0.453720046871409
## [42123] chrX 153980500-153991500 * | 0.546301068655939
## [42124] chrX 154255500-154493500 * | 0.47934214168793
## [42125] chrX 154356500-154444500 * | 0.49092225993235
## [42126] chrX 154444500-154664500 * | 0.493389088498317
## FDR
##
## [1] 5.20045729958651e-36
## [2] 2.1460798658766e-29
## [3] 1.70141157082175e-118
## [4] 1.97244775926907e-29
## [5] 1.34094093110629e-38
## … …
## [42122] 3.78754241443136e-25
## [42123] 4.54936701954597e-38
## [42124] 2.47516776950818e-28
## [42125] 7.28579053551407e-30
## [42126] 3.37643947034054e-30
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths

Decreasing the resolution even further also decreases the total number of peak-to-gene links identified.

p2g <- getPeak2GeneLinks(
    ArchRProj = projHeme5,
    corCutOff = 0.45,
    resolution = 10000,
    returnLoops = TRUE
)

p2g[[1]]

## GRanges object with 33645 ranges and 2 metadata columns:
## seqnames ranges strand | value
## |
## [1] chr1 765000-945000 * | 0.533350285896763
## [2] chr1 805000-895000 * | 0.487418258348982
## [3] chr1 805000-905000 * | 0.821369314290254
## [4] chr1 855000-955000 * | 0.487693663181145
## [5] chr1 895000-965000 * | 0.549552663393716
## … … … … . …
## [33641] chrX 153965000-153975000 * | 0.453720046871409
## [33642] chrX 153985000-153995000 * | 0.546301068655939
## [33643] chrX 154255000-154495000 * | 0.47934214168793
## [33644] chrX 154355000-154445000 * | 0.49092225993235
## [33645] chrX 154445000-154665000 * | 0.493389088498317
## FDR
##
## [1] 5.20045729958651e-36
## [2] 2.1460798658766e-29
## [3] 1.70141157082175e-118
## [4] 1.97244775926907e-29
## [5] 1.34094093110629e-38
## … …
## [33641] 3.78754241443136e-25
## [33642] 4.54936701954597e-38
## [33643] 2.47516776950818e-28
## [33644] 7.28579053551407e-30
## [33645] 3.37643947034054e-30
## ——-
## seqinfo: 23 sequences from an unspecified genome; no seqlengths