使用SnapATAC分析單細胞ATAC-seq數據(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq

image.png

簡介

在本教程中,我們將對PBMC的兩個scATAC-seq數據集(5K和10K)和一個scRNA-seq數據集進行整合分析。這三個數據集均來自10X genomics測序平臺產生的數據,可以直接在10x官網下載使用。
具體來說,我們將主要執行以下分析內容:

  • Cell selection for PBMC 5k and 10k scATAC;
  • Randomly sample 10,000 cells as landmarks;
  • Unsupervised clustering of landmarks;
  • Project the remaining (query) cells onto the landmarks;
  • Supervised annotation of scATAC clusters using scRNA dataset;
  • Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步驟

  • Step 0. Data download
  • Step 1. Barcode selection
  • Step 2. Add cell-by-bin matrix
  • Step 3. Matrix binarization
  • Step 4. Bin filtering
  • Step 5. Dimensionality reduction of landmarks
  • Step 6. Determine significant components
  • Step 7. Graph-based clustering
  • Step 8. Visualization
  • Step 9. scRNA-seq based annotation
  • Step 10. Create psudo multiomics cells
  • Step 11. Remove cells of low prediction score
  • Step 12. Gene expression projected into UMAP
  • Step 13. Identify peak
  • Step 14. Create a cell-by-peak matrix
  • Step 15. Identify differentially accessible regions
  • Step 16. Motif variability analysis
  • Step 17. De novo motif discovery
  • Step 18. Predict gene-enhancer pairs

Step 0. Data Download

開始分析之前,我們需要下載snaptools生成的snap文件和cellranger-atac產生的singlecell.csv文件。

# 下載所需的數據集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed

Step 1. Barcode selection

首先,我們根據以下兩個主要的標準來選擇高質量的barcodes:

  1. number of unique fragments;
  2. fragments in promoter ratio;
# 加載所需R包
> library(SnapATAC);
> snap.files = c(
    "atac_pbmc_5k_nextgem.snap", 
    "atac_pbmc_10k_nextgem.snap"
  );
> sample.names = c(
    "PBMC 5K",
    "PBMC 10K"
  );
> barcode.files = c(
    "atac_pbmc_5k_nextgem_singlecell.csv",
    "atac_pbmc_10k_nextgem_singlecell.csv"
  );

# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
    createSnap(
        file=snap.files[i],
        sample=sample.names[i]
    );
  })
> names(x.sp.ls) = sample.names;

# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = read.csv(
        barcode.files[i], 
        head=TRUE
    );
    # remove NO BAROCDE line
    barcodes = barcodes[2:nrow(barcodes),];
    barcodes$logUMI = log10(barcodes$passed_filters + 1);
    barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    barcodes
  })

# 質控指標數據可視化
> plots = lapply(seq(snap.files), function(i){
    p1 = ggplot(
        barcode.ls[[i]], 
        aes(x=logUMI, y=promoter_ratio)) + 
        geom_point(size=0.3, col="grey") +
        theme_classic() +
        ggtitle(sample.names[[i]]) +
        ylim(0, 1) + xlim(0, 6) + 
        labs(x = "log10(UMI)", y="promoter ratio")
        p1
    })
> plots
image.png
# 查看snap對象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設定質控閾值進行篩選過濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);

# 根據過濾的閾值進行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    idx = which(
        barcodes$logUMI >= cutoff.logUMI.low[i] & 
        barcodes$logUMI <= cutoff.logUMI.high[i] & 
        barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
        barcodes$promoter_ratio <= cutoff.FRIP.high[i]
    );
    barcodes[idx,]
  });

> x.sp.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    x.sp = x.sp.ls[[i]];
    barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
    x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
    barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
    x.sp@metaData = barcodes;
    x.sp
  })

> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object

# combine two snap object
# 使用Reduce函數將兩個snap對象進行合并
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;

> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0

> table(x.sp@sample);
## PBMC 10K  PBMC 5K
##     9039     4526

Step 2. Add cell-by-bin matrix

接下來,我們使用addBmatToSnap函數生成5kb分辨率的cell-by-bin計數矩陣,并將其添加到snap對象中。該函數將自動從兩個snap文件中讀取cell-by-bin矩陣,并將連接后的矩陣添加到snap對象的bmat屬性中。

# 使用addBmatToSnap函數計算cell-by-bin計數矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);

Step 3. Matrix binarization

我們使用makeBinary函數將cell-by-bin計數矩陣轉換為二進制矩陣。在count矩陣中,某些項具有異常高的覆蓋率,這可能是由比對錯誤造成的。因此,我們將刪除計數矩陣中top 0.1%的項,并將其余非零的值轉換為1。

# 使用makeBinary函數將cell-by-bin計數矩陣轉換成二進制矩陣
> x.sp = makeBinary(x.sp, mat="bmat");

Step 4. Bin filtering

首先,我們過濾掉任何與ENCODE中blacklist區域重疊的bins,以防止潛在的artifacts。

> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
    black_list[,1], 
    IRanges(black_list[,2], black_list[,3])
  );

> idy = queryHits(
    findOverlaps(x.sp@feature, black_list.gr)
  );
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"];
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

接下來,我們移除那些不需要的染色體信息。

> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];

> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"]
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

第三,bins的覆蓋率大致是服從對數正態分布的。我們將與不變特征(如管家基因的啟動子)重疊的前5%的bins進行刪除 。

> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
    bin.cov[bin.cov > 0], 
    xlab="log10(bin cov)", 
    main="log10(Bin Cov)", 
    col="lightblue", 
    xlim=c(0, 5)
  );
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
image.png

最后,我們將進一步刪除bin覆蓋率小于1000的任何細胞,這是因為盡管有些細胞可能含有較高的unique fragments片段,但是過濾后的bin覆蓋率卻很低。此步驟是可選的,但強烈建議這樣做。

> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Step 5. Dimentionality reduction

SnapATAC采用diffusion maps算法進行數據降維,這是一種非線性降維的技術,它通過對數據執行random walk來發現低維的manifold,并且對噪音和擾動具有很強的魯棒性。

但是,diffusion maps算法的計算成本會隨著細胞數目的增加而呈現指數級增長的趨勢。為了克服這一局限,我們將Nystr?m方法(a sampling technique)和diffusion maps算法相結合,給出Nystr?m Landmark diffusion map來生成大規模數據集的低維嵌入。

Nystr?m landmark diffusion maps算法主要包括以下三個步驟:

  • sampling: sample a subset of K (K?N) cells from N total cells as “landmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;
  • embedding: compute a diffusion map embedding for K landmarks;
  • extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中,我們將采樣10,000個細胞作為landmarks,并將余下的query細胞投射到嵌入landmarks的diffusion maps上。

> row.covs.dens <- density(
    x = x.sp@metaData[,"logUMI"], 
    bw = 'nrd', adjust = 1
  );
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);

> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

將x.sp對象分割為landmark(x.landmark.sp)細胞和query(x.query.sp)細胞。

> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark細胞進行diffusion maps降維處理

> x.landmark.sp = runDiffusionMaps(
    obj= x.landmark.sp,
    input.mat="bmat", 
    num.eigs=50
  );
> x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
將query細胞映射到landmarks上

> x.query.sp = runDiffusionMapsExtension(
    obj1=x.landmark.sp, 
    obj2=x.query.sp,
    input.mat="bmat"
  );
> x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合并landmark和query細胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT

Step 6. Determine significant components

接下來,我們將確定特征向量(eigen-vectors)的數目,以便用于后續的分析。在下面的例子中,我們選擇前15個特征向量。

> plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
  );
image.png

Step 7. Graph-based clustering

接下來,我們使用所選的特征向量成分,來構造K近鄰(KNN)聚類圖。其中,每個節點是一個細胞,根據歐氏距離來確定每個細胞的k個近鄰細胞。

> x.sp = runKNN(
    obj=x.sp,
    eigs.dims=1:20,
    k=15
  );

> library(leiden);
> x.sp=runCluster(
    obj=x.sp,
    tmp.folder=tempdir(),
    louvain.lib="leiden",
    seed.use=10,
    resolution=0.7
  );

Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對降維聚類后的細胞進行可視化的展示和探索。在本示例中,我們使用UMAP方法進行展示。

> library(umap);
> x.sp = runViz(
    obj=x.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:20, 
    method="umap",
    seed.use=10
  );

> par(mfrow = c(2, 2));
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Cluster",
    point.color=x.sp@cluster, 
    point.size=0.2, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
> plotFeatureSingle(
    obj=x.sp,
    feature.value=x.sp@metaData[,"logUMI"],
    method="umap", 
    main="Read Depth",
    point.size=0.2, 
    point.shape=19, 
    down.sample=10000,
    quantiles=c(0.01, 0.99)
  );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Sample",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@sample, 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
    );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Landmark",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@metaData[,"landmark"], 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
  );
image.png

Step 9. scRNA-seq based annotation

在本示例中,我們將使用相應的scRNA-seq數據集來對scATAC-seq數據的細胞類群進行注釋。我們可以通過(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對象。

> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);

> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];

## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
    obj=x.sp, 
    input.mat="bmat",
    genes=genes.sel.gr,
    do.par=TRUE,
    num.cores=10
  );

接下來,我們將snap對象轉換為Seurat對象,用于后續的數據整合。

# 使用snapToSeurat函數將snap對象轉換為Seurat對象
> pbmc.atac <- snapToSeurat(
    obj=x.sp, 
    eigs.dims=1:20, 
    norm=TRUE,
    scale=TRUE
  );

# 識別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna, 
    query = pbmc.atac, 
    features = variable.genes, 
    reference.assay = "RNA", 
    query.assay = "ACTIVITY", 
    reduction = "cca"
  );

# 使用TransferData函數進行數據映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors, 
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);

Step 10. Create psudo multiomics cells

現在,snap對象x.sp中包含了的每個細胞的染色質可及性@bmat和基因表達@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna, 
    assay = "RNA", 
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors, 
    refdata = refdata, 
    weight.reduction = pbmc.atac[["SnapATAC"]], 
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory

Step 11. Remove cells of low prediction score

> hist(
    x.sp@metaData$predict.max.score, 
    xlab="prediction score", 
    col="lightblue", 
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap", 
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"], 
    point.size=0.5, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
image.png

Step 12. Gene expression projected onto UMAP

接下來,我們使用UMAP方法對一些marker基因的表達水平進行可視化展示。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A", 
    "CD68", "MS4A1", "GNLY", 
    "CD3E", "CD14", "CD14", 
    "FCGR3A", "LYZ", "PPBP", 
    "CD8A", "PPBP", "CST3", 
    "NKG7", "MS4A7", "MS4A1", 
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap", 
        main=marker.genes[i],
        point.size=0.1, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
 )};
image.png

Step 13. Identify peaks

接下來,我們將每個cluster群中的reads進行聚合,創建用于peak calling和可視化的集成track。執行完peak calling后,將生成一個.narrowPeak的文件,該文件中包含了識別出的peaks信息,和一個.bedgraph的文件,可以用于可視化展示。為了獲得最robust的結果,我們不建議對細胞數目小于100的cluster群執行此步驟。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函數調用macs2進行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),], 
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

為了對所有的cluster群執行peak calling,我們提供了一個簡便的腳本來實現該步驟。

# call peaks for all cluster with more than 100 cells
# 選出那些細胞數目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量對不同的cluster進行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
    }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我們可以將每個cluster群生成的bdg文件導入到IGV或其他基因組瀏覽器(如UW genome browser)進行可視化的展示,下面是來自UW genome browser的FOXJ2基因及其側翼區域的screenshot。


image.png

Step 14. Create a cell-by-peak matrix

接下來,我們使用合并后的peaks作為參考,來創建一個cell-by-peak計數矩陣,并將其添加到snap對象中。
首先,我們將合并后的peaks信息寫入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下來,我們創建一個cell-by-peak計數矩陣,并將其添加到snap對象中。這一步可能需要一段時間。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed  

然后,我們將計算好的cell-by-peak計數矩陣添加到已有的snap對象中。

> x.sp = addPmatToSnap(x.sp);

Step 15. Identify differentially accessible peaks

對于一組給定的細胞Ci,我們首先在diffusion component空間中尋找鄰近的細胞Cj (|Ci|=|Cj|)作為“background”細胞進行比較。如果Ci細胞的數目占總細胞數目的一半以上,則使用剩余的細胞作為local background。接下來,我們將Ci和Cj細胞進行聚合,來創建兩個raw-count向量,即Vci和Vcj。然后,我們使用R包edgeR (v3.18.1)的精確檢驗(exact test)來對Vci和Vcj進行差異分析,其中,對于小鼠設置BCV=0.1,而人類設置BCV= 0.4。最后,再使用Benjamini-Hochberg多重檢驗校正方法,來調整p-value值為錯誤發現率(FDR)值。對于FDR值小于0.05的peaks,被選為顯著的DARs。

我們發現這種方法也存在一定的局限性,特別是對于那些細胞數目較少的cluster群,統計的有效性可能不夠強大。對于那些未能識別出顯著差異peaks的cluster群,我們將根據富集p-value值對元素進行排序,并挑選出最具代表性的2,000個peaks進行后續的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC, 
    pch=19, cex=0.1, col="grey", 
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy], 
    DARs$logFC[idy], 
    pch=19, 
    cex=0.5, 
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap", 
    main="CD14+ Monocytes",
    point.size=0.1, 
    point.shape=19, 
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );

Step 16. Motif variability analysis

SnapATAC可以調用chromVAR(Schep et al)程序進行motif可變性分析。

# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函數調用ChromVAR進行motif可變性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
    theme_classic() +
    geom_violin() + 
    xlab("cluster") +
    ylab("motif enrichment") + 
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );
image.png

Step 17. De novo motif discovery

SnapATAC還可以調用homer用于對識別出的差異可及性區域(DARs)進行motif的識別和富集分析.

# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函數鑒定DAR區域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函數調用homer進行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"], 
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );
image.png

Step 18. Predict gene-enhancer pairs

最后,我們還可以使用一種“pseudo”細胞的方法,根據單個細胞中基因的表達和遠端調控元件的染色質可及性的關系,將遠端調控元件連接到目標基因上。對于給定的marker基因,我們首先識別出marker基因側翼區域內的peak。對于每個側翼的peak,我們使用基因表達作為輸入變量進行邏輯回歸來預測binarized的染色質狀態。產生的結果估計了染色質可及性與基因表達之間聯系的重要性。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp, 
    input.mat="pmat",
    gene.name="C3AR1", 
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000
image.png

參考來源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,238評論 6 531
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,430評論 3 415
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 176,134評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,893評論 1 309
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,653評論 6 408
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,136評論 1 323
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,212評論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,372評論 0 288
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,888評論 1 334
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,738評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,939評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,482評論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,179評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,588評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,829評論 1 283
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,610評論 3 391
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,916評論 2 372