久久久久亚州av无吗a片,青草全福视在线,三上悠亚公侵犯344在线观看

image.png

簡介

在本教程中，我們將對PBMC的兩個scATAC-seq數據集(5K和10K)和一個scRNA-seq數據集進行整合分析。這三個數據集均來自10X genomics測序平臺產生的數據，可以直接在10x官網下載使用。
具體來說，我們將主要執行以下分析內容：

Cell selection for PBMC 5k and 10k scATAC;
Randomly sample 10,000 cells as landmarks;
Unsupervised clustering of landmarks;
Project the remaining (query) cells onto the landmarks;
Supervised annotation of scATAC clusters using scRNA dataset;
Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步驟

Step 0. Data download
Step 1. Barcode selection
Step 2. Add cell-by-bin matrix
Step 3. Matrix binarization
Step 4. Bin filtering
Step 5. Dimensionality reduction of landmarks
Step 6. Determine significant components
Step 7. Graph-based clustering
Step 8. Visualization
Step 9. scRNA-seq based annotation
Step 10. Create psudo multiomics cells
Step 11. Remove cells of low prediction score
Step 12. Gene expression projected into UMAP
Step 13. Identify peak
Step 14. Create a cell-by-peak matrix
Step 15. Identify differentially accessible regions
Step 16. Motif variability analysis
Step 17. De novo motif discovery
Step 18. Predict gene-enhancer pairs

Step 0. Data Download

開始分析之前，我們需要下載snaptools生成的snap文件和cellranger-atac產生的singlecell.csv文件。

# 下載所需的數據集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed

Step 1. Barcode selection

首先，我們根據以下兩個主要的標準來選擇高質量的barcodes：

number of unique fragments;
fragments in promoter ratio;

# 加載所需R包
> library(SnapATAC);
> snap.files = c(
    "atac_pbmc_5k_nextgem.snap", 
    "atac_pbmc_10k_nextgem.snap"
  );
> sample.names = c(
    "PBMC 5K",
    "PBMC 10K"
  );
> barcode.files = c(
    "atac_pbmc_5k_nextgem_singlecell.csv",
    "atac_pbmc_10k_nextgem_singlecell.csv"
  );

# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
    createSnap(
        file=snap.files[i],
        sample=sample.names[i]
    );
  })
> names(x.sp.ls) = sample.names;

# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = read.csv(
        barcode.files[i], 
        head=TRUE
    );
    # remove NO BAROCDE line
    barcodes = barcodes[2:nrow(barcodes),];
    barcodes$logUMI = log10(barcodes$passed_filters + 1);
    barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    barcodes
  })

# 質控指標數據可視化
> plots = lapply(seq(snap.files), function(i){
    p1 = ggplot(
        barcode.ls[[i]], 
        aes(x=logUMI, y=promoter_ratio)) + 
        geom_point(size=0.3, col="grey") +
        theme_classic() +
        ggtitle(sample.names[[i]]) +
        ylim(0, 1) + xlim(0, 6) + 
        labs(x = "log10(UMI)", y="promoter ratio")
        p1
    })
> plots

image.png

# 查看snap對象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設定質控閾值進行篩選過濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);

# 根據過濾的閾值進行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    idx = which(
        barcodes$logUMI >= cutoff.logUMI.low[i] & 
        barcodes$logUMI <= cutoff.logUMI.high[i] & 
        barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
        barcodes$promoter_ratio <= cutoff.FRIP.high[i]
    );
    barcodes[idx,]
  });

> x.sp.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    x.sp = x.sp.ls[[i]];
    barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
    x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
    barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
    x.sp@metaData = barcodes;
    x.sp
  })

> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object

# combine two snap object
# 使用Reduce函數將兩個snap對象進行合并
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;

> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0

> table(x.sp@sample);
## PBMC 10K  PBMC 5K
##     9039     4526

Step 2. Add cell-by-bin matrix

接下來，我們使用addBmatToSnap函數生成5kb分辨率的cell-by-bin計數矩陣，并將其添加到snap對象中。該函數將自動從兩個snap文件中讀取cell-by-bin矩陣，并將連接后的矩陣添加到snap對象的bmat屬性中。

# 使用addBmatToSnap函數計算cell-by-bin計數矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);

Step 3. Matrix binarization

我們使用makeBinary函數將cell-by-bin計數矩陣轉換為二進制矩陣。在count矩陣中，某些項具有異常高的覆蓋率，這可能是由比對錯誤造成的。因此，我們將刪除計數矩陣中top 0.1%的項，并將其余非零的值轉換為1。

# 使用makeBinary函數將cell-by-bin計數矩陣轉換成二進制矩陣
> x.sp = makeBinary(x.sp, mat="bmat");

Step 4. Bin filtering

首先，我們過濾掉任何與ENCODE中blacklist區域重疊的bins，以防止潛在的artifacts。

> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
    black_list[,1], 
    IRanges(black_list[,2], black_list[,3])
  );

> idy = queryHits(
    findOverlaps(x.sp@feature, black_list.gr)
  );
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"];
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

接下來，我們移除那些不需要的染色體信息。

> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];

> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"]
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

第三，bins的覆蓋率大致是服從對數正態分布的。我們將與不變特征(如管家基因的啟動子)重疊的前5%的bins進行刪除。

> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
    bin.cov[bin.cov > 0], 
    xlab="log10(bin cov)", 
    main="log10(Bin Cov)", 
    col="lightblue", 
    xlim=c(0, 5)
  );
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

image.png

最后，我們將進一步刪除bin覆蓋率小于1000的任何細胞，這是因為盡管有些細胞可能含有較高的unique fragments片段，但是過濾后的bin覆蓋率卻很低。此步驟是可選的，但強烈建議這樣做。

> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Step 5. Dimentionality reduction

SnapATAC采用diffusion maps算法進行數據降維，這是一種非線性降維的技術，它通過對數據執行random walk來發現低維的manifold，并且對噪音和擾動具有很強的魯棒性。

但是，diffusion maps算法的計算成本會隨著細胞數目的增加而呈現指數級增長的趨勢。為了克服這一局限，我們將Nystr?m方法(a sampling technique)和diffusion maps算法相結合，給出Nystr?m Landmark diffusion map來生成大規模數據集的低維嵌入。

Nystr?m landmark diffusion maps算法主要包括以下三個步驟：

sampling: sample a subset of K (K?N) cells from N total cells as “landmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;
embedding: compute a diffusion map embedding for K landmarks;
extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中，我們將采樣10,000個細胞作為landmarks，并將余下的query細胞投射到嵌入landmarks的diffusion maps上。

> row.covs.dens <- density(
    x = x.sp@metaData[,"logUMI"], 
    bw = 'nrd', adjust = 1
  );
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);

> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

將x.sp對象分割為landmark(x.landmark.sp)細胞和query(x.query.sp)細胞。

> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark細胞進行diffusion maps降維處理

> x.landmark.sp = runDiffusionMaps(
    obj= x.landmark.sp,
    input.mat="bmat", 
    num.eigs=50
  );
> x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
將query細胞映射到landmarks上

> x.query.sp = runDiffusionMapsExtension(
    obj1=x.landmark.sp, 
    obj2=x.query.sp,
    input.mat="bmat"
  );
> x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合并landmark和query細胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT

Step 6. Determine significant components

接下來，我們將確定特征向量（eigen-vectors）的數目，以便用于后續的分析。在下面的例子中，我們選擇前15個特征向量。

> plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
  );

image.png

Step 7. Graph-based clustering

接下來，我們使用所選的特征向量成分，來構造K近鄰（KNN）聚類圖。其中，每個節點是一個細胞，根據歐氏距離來確定每個細胞的k個近鄰細胞。

> x.sp = runKNN(
    obj=x.sp,
    eigs.dims=1:20,
    k=15
  );

> library(leiden);
> x.sp=runCluster(
    obj=x.sp,
    tmp.folder=tempdir(),
    louvain.lib="leiden",
    seed.use=10,
    resolution=0.7
  );

Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對降維聚類后的細胞進行可視化的展示和探索。在本示例中，我們使用UMAP方法進行展示。

> library(umap);
> x.sp = runViz(
    obj=x.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:20, 
    method="umap",
    seed.use=10
  );

> par(mfrow = c(2, 2));
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Cluster",
    point.color=x.sp@cluster, 
    point.size=0.2, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
> plotFeatureSingle(
    obj=x.sp,
    feature.value=x.sp@metaData[,"logUMI"],
    method="umap", 
    main="Read Depth",
    point.size=0.2, 
    point.shape=19, 
    down.sample=10000,
    quantiles=c(0.01, 0.99)
  );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Sample",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@sample, 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
    );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Landmark",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@metaData[,"landmark"], 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
  );

image.png

Step 9. scRNA-seq based annotation

在本示例中，我們將使用相應的scRNA-seq數據集來對scATAC-seq數據的細胞類群進行注釋。我們可以通過(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對象。

> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);

> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];

## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
    obj=x.sp, 
    input.mat="bmat",
    genes=genes.sel.gr,
    do.par=TRUE,
    num.cores=10
  );

接下來，我們將snap對象轉換為Seurat對象，用于后續的數據整合。

# 使用snapToSeurat函數將snap對象轉換為Seurat對象
> pbmc.atac <- snapToSeurat(
    obj=x.sp, 
    eigs.dims=1:20, 
    norm=TRUE,
    scale=TRUE
  );

# 識別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna, 
    query = pbmc.atac, 
    features = variable.genes, 
    reference.assay = "RNA", 
    query.assay = "ACTIVITY", 
    reduction = "cca"
  );

# 使用TransferData函數進行數據映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors, 
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);

Step 10. Create psudo multiomics cells

現在，snap對象x.sp中包含了的每個細胞的染色質可及性@bmat和基因表達@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna, 
    assay = "RNA", 
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors, 
    refdata = refdata, 
    weight.reduction = pbmc.atac[["SnapATAC"]], 
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory

Step 11. Remove cells of low prediction score

> hist(
    x.sp@metaData$predict.max.score, 
    xlab="prediction score", 
    col="lightblue", 
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap", 
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"], 
    point.size=0.5, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );

image.png

Step 12. Gene expression projected onto UMAP

接下來，我們使用UMAP方法對一些marker基因的表達水平進行可視化展示。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A", 
    "CD68", "MS4A1", "GNLY", 
    "CD3E", "CD14", "CD14", 
    "FCGR3A", "LYZ", "PPBP", 
    "CD8A", "PPBP", "CST3", 
    "NKG7", "MS4A7", "MS4A1", 
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap", 
        main=marker.genes[i],
        point.size=0.1, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
 )};

image.png

Step 13. Identify peaks

接下來，我們將每個cluster群中的reads進行聚合，創建用于peak calling和可視化的集成track。執行完peak calling后，將生成一個.narrowPeak的文件，該文件中包含了識別出的peaks信息，和一個.bedgraph的文件，可以用于可視化展示。為了獲得最robust的結果，我們不建議對細胞數目小于100的cluster群執行此步驟。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函數調用macs2進行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),], 
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

為了對所有的cluster群執行peak calling，我們提供了一個簡便的腳本來實現該步驟。

# call peaks for all cluster with more than 100 cells
# 選出那些細胞數目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量對不同的cluster進行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
    }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我們可以將每個cluster群生成的bdg文件導入到IGV或其他基因組瀏覽器(如UW genome browser)進行可視化的展示，下面是來自UW genome browser的FOXJ2基因及其側翼區域的screenshot。

image.png

Step 14. Create a cell-by-peak matrix

接下來，我們使用合并后的peaks作為參考，來創建一個cell-by-peak計數矩陣，并將其添加到snap對象中。
首先，我們將合并后的peaks信息寫入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下來，我們創建一個cell-by-peak計數矩陣，并將其添加到snap對象中。這一步可能需要一段時間。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed

然后，我們將計算好的cell-by-peak計數矩陣添加到已有的snap對象中。

> x.sp = addPmatToSnap(x.sp);

Step 15. Identify differentially accessible peaks

對于一組給定的細胞Ci，我們首先在diffusion component空間中尋找鄰近的細胞Cj (|Ci|=|Cj|)作為“background”細胞進行比較。如果Ci細胞的數目占總細胞數目的一半以上，則使用剩余的細胞作為local background。接下來，我們將Ci和Cj細胞進行聚合，來創建兩個raw-count向量，即Vci和Vcj。然后，我們使用R包edgeR (v3.18.1)的精確檢驗（exact test）來對Vci和Vcj進行差異分析，其中，對于小鼠設置BCV=0.1，而人類設置BCV= 0.4。最后，再使用Benjamini-Hochberg多重檢驗校正方法，來調整p-value值為錯誤發現率(FDR)值。對于FDR值小于0.05的peaks，被選為顯著的DARs。

我們發現這種方法也存在一定的局限性，特別是對于那些細胞數目較少的cluster群，統計的有效性可能不夠強大。對于那些未能識別出顯著差異peaks的cluster群，我們將根據富集p-value值對元素進行排序，并挑選出最具代表性的2,000個peaks進行后續的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC, 
    pch=19, cex=0.1, col="grey", 
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy], 
    DARs$logFC[idy], 
    pch=19, 
    cex=0.5, 
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap", 
    main="CD14+ Monocytes",
    point.size=0.1, 
    point.shape=19, 
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );

Step 16. Motif variability analysis

SnapATAC可以調用chromVAR(Schep et al)程序進行motif可變性分析。

# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函數調用ChromVAR進行motif可變性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
    theme_classic() +
    geom_violin() + 
    xlab("cluster") +
    ylab("motif enrichment") + 
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );

image.png

Step 17. De novo motif discovery

SnapATAC還可以調用homer用于對識別出的差異可及性區域(DARs)進行motif的識別和富集分析.

# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函數鑒定DAR區域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函數調用homer進行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"], 
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );

image.png

Step 18. Predict gene-enhancer pairs

最后，我們還可以使用一種“pseudo”細胞的方法，根據單個細胞中基因的表達和遠端調控元件的染色質可及性的關系，將遠端調控元件連接到目標基因上。對于給定的marker基因，我們首先識別出marker基因側翼區域內的peak。對于每個側翼的peak，我們使用基因表達作為輸入變量進行邏輯回歸來預測binarized的染色質狀態。產生的結果估計了染色質可及性與基因表達之間聯系的重要性。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp, 
    input.mat="pmat",
    gene.name="C3AR1", 
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000

image.png

參考來源：https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

使用SnapATAC分析單細胞ATAC-seq數據（三）：Integrative Analysis of PBMC scATAC-seq and scRNA-seq

使用SnapATAC分析單細胞ATAC-seq數據（三）：Integrative Analysis of PBMC scATAC-seq and scRNA-seq

簡介

分析步驟

Step 0. Data Download

Step 1. Barcode selection

Step 2. Add cell-by-bin matrix

Step 3. Matrix binarization

Step 4. Bin filtering

Step 5. Dimentionality reduction

Step 6. Determine significant components

Step 7. Graph-based clustering

Step 8. Visualization

Step 9. scRNA-seq based annotation

Step 10. Create psudo multiomics cells

Step 11. Remove cells of low prediction score

Step 12. Gene expression projected onto UMAP

Step 13. Identify peaks

Step 14. Create a cell-by-peak matrix

Step 15. Identify differentially accessible peaks

Step 16. Motif variability analysis

Step 17. De novo motif discovery

Step 18. Predict gene-enhancer pairs

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

使用SnapATAC分析單細胞ATAC-seq數據（三）：Integrative Analysis of PBMC scATAC-seq and scRNA-seq

簡介

分析步驟

Step 0. Data Download

Step 1. Barcode selection

Step 2. Add cell-by-bin matrix

Step 3. Matrix binarization

Step 4. Bin filtering

Step 5. Dimentionality reduction

Step 6. Determine significant components

Step 7. Graph-based clustering

Step 8. Visualization

Step 9. scRNA-seq based annotation

Step 10. Create psudo multiomics cells

Step 11. Remove cells of low prediction score

Step 12. Gene expression projected onto UMAP

Step 13. Identify peaks

Step 14. Create a cell-by-peak matrix

Step 15. Identify differentially accessible peaks

Step 16. Motif variability analysis

Step 17. De novo motif discovery

Step 18. Predict gene-enhancer pairs

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频