參考:R document
Seurat識(shí)別細(xì)胞類(lèi)群的原理FindNeighbors和FindClusters
Jaccard系數(shù)_百度百科
Annoy解析
https://blog.csdn.net/qq_37696858/article/details/88143156
(Shared) Nearest-neighbor graph construction
- Description
Computes the k.param nearest neighbors for a given dataset. Can also optionally (via compute.SNN), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors.
首先計(jì)算每個(gè)細(xì)胞的KNN,也就是計(jì)算每個(gè)細(xì)胞之間的相互距離,依據(jù)細(xì)胞之間鄰居的overlap來(lái)構(gòu)建snn graph。
計(jì)算給定數(shù)據(jù)集的k.param最近鄰。也可以選擇(通過(guò)compute.SNN),通過(guò)計(jì)算每個(gè)細(xì)胞最近鄰之間的鄰域重疊(Jaccard索引)和其鄰近的k.param來(lái)構(gòu)造SNN。 - Usage
FindNeighbors(object, ...)
Default S3 method:
FindNeighbors(
object,
query = NULL,
distance.matrix = FALSE,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
index = NULL,
...
)
S3 method for class 'Assay'
FindNeighbors(
object,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'dist'
FindNeighbors(
object,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'Seurat'
FindNeighbors(
object,
reduction = "pca",
dims = 1:10,
assay = NULL,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
do.plot = FALSE,
graph.name = NULL,
l2.norm = FALSE,
cache.index = FALSE,
...
)
- Arguments
object
An object
...
Arguments passed to other methods
query
Matrix of data to query against object. If missing, defaults to object.
詢(xún)問(wèn)
根據(jù)對(duì)象查詢(xún)的數(shù)據(jù)矩陣。如果缺少,默認(rèn)為對(duì)象。
distance.matrix
Boolean value of whether the provided matrix is a distance matrix; note, for objects of class dist, this parameter will be set automatically
距離矩陣
提供的矩陣是否為距離矩陣的布爾值;注意,對(duì)于dist類(lèi)的對(duì)象,此參數(shù)設(shè)置為 automatically
k.param
Defines k for the k-nearest neighbor algorithm
這個(gè)參數(shù)用于設(shè)置 KNN 算法中最近鄰的個(gè)數(shù),默認(rèn)20
return.neighbor
Return result as Neighbor object. Not used with distance matrix input.
返回的鄰居
將結(jié)果作為鄰居對(duì)象返回。不適用于距離矩陣輸入。
compute.SNN
also compute the shared nearest neighbor graph
計(jì)算共享鄰居的數(shù)量,一般不設(shè)置。
prune.SNN
Sets the cutoff for acceptable Jaccard index when computing the neighborhood overlap for the SNN construction. Any edges with values less than or equal to this will be set to 0 and removed from the SNN graph. Essentially sets the stringency of pruning (0 — no pruning, 1 — prune everything).
計(jì)算SNN構(gòu)造的領(lǐng)域重疊時(shí),設(shè)置可接受的Jaccard指數(shù)的截止值。任何小于或者等于此值的任何邊將被設(shè)置為0并從SNN圖中刪除。本質(zhì)上設(shè)置修剪的嚴(yán)格性(0-不修剪,1-修剪所有內(nèi)容)。
nn.method
Method for nearest neighbor finding. Options include: rann, annoy
這個(gè)參數(shù)提供了如何判斷鄰居的方法,提供的可選是rann, annoy
n.trees
More trees gives higher precision when using annoy approximate nearest neighbor search
樹(shù)的數(shù)量
當(dāng)使用annoy的近似最近鄰搜索時(shí),樹(shù)越多,給出的精度越高。
annoy.metric
Distance metric for annoy. Options include: euclidean, cosine, manhattan, and hamming
annoy的距離,選項(xiàng)包括:歐幾里德、余弦、曼哈頓和漢明
nn.eps
Error bound when performing nearest neighbor seach using RANN; default of 0.0 implies exact nearest neighbor search
使用RANN執(zhí)行最近鄰搜索時(shí)的錯(cuò)誤界限;默認(rèn)值0.0表示精確最近鄰搜索
verbose
Whether or not to print output to the console
是否將輸出打印到控制臺(tái)
force.recalc
Force recalculation of (S)NN.
SNN強(qiáng)制重新計(jì)算,一般不設(shè)置
l2.norm
Take L2Norm of the data
L2正則化
cache.index
Include cached index in returned Neighbor object (only relevant if return.neighbor = TRUE)
在返回的鄰居對(duì)象中包含緩存的索引(僅當(dāng)return.neighbor = TRUE時(shí)相關(guān))
index
Precomputed index. Useful if querying new data against existing index to avoid recomputing.
索引
預(yù)計(jì)算索引。如果根據(jù)現(xiàn)有索引查詢(xún)新數(shù)據(jù)以避免重新計(jì)算,則非常有用。
features
Features to use as input for building the (S)NN; used only when dims is NULL
特征
用作構(gòu)建SNN神經(jīng)網(wǎng)絡(luò)輸入的特征;僅當(dāng)dims為空時(shí)使用
reduction
Reduction to use as input for building the (S)NN
輸入的降維方法,用于構(gòu)建SNN神經(jīng)網(wǎng)絡(luò)
dims
Dimensions of reduction to use as input
輸入的降維的維度
assay
Assay to use in construction of (S)NN; used only when dims is NULL
用于構(gòu)造SNN神經(jīng)網(wǎng)絡(luò)的分析;僅當(dāng)dims為空時(shí)使用
do.plot
Plot SNN graph on tSNE coordinates
在tSNE坐標(biāo)上繪制SNN圖
graph.name
Optional naming parameter for stored (S)NN graph (or Neighbor object, if return.neighbor = TRUE). Default is assay.name_(s)nn. To store both the neighbor graph and the shared nearest neighbor (SNN) graph, you must supply a vector containing two names to the graph.name parameter. The first element in the vector will be used to store the nearest neighbor (NN) graph, and the second element used to store the SNN graph. If only one name is supplied, only the NN graph is stored.
可選命名存儲(chǔ)的神經(jīng)網(wǎng)絡(luò)圖的參數(shù)(或鄰居對(duì)象,如果返回的鄰居=真)。默認(rèn)為assay.name_(s)nn。要存儲(chǔ)鄰居圖和共享最近鄰居(SNN)圖,必須向graph.name參數(shù)提供一個(gè)包含兩個(gè)名稱(chēng)的向量。向量中的第一個(gè)元素將用于存儲(chǔ)最近鄰圖,第二個(gè)元素用于存儲(chǔ)SNN圖。如果只提供一個(gè)名稱(chēng),則只存儲(chǔ)神經(jīng)網(wǎng)絡(luò)圖。
- Value
This function can either return a Neighbor object with the KNN information or a list of Graph objects with the KNN and SNN depending on the settings of return.neighbor and compute.SNN. When running on a Seurat object, this returns the Seurat object with the Graphs or Neighbor objects stored in their respective slots. Names of the Graph or Neighbor object can be found with Graphs or Neighbors.
值
根據(jù)return.neighbor和compute.SNN的設(shè)置,該函數(shù)可以返回帶有KNN信息的neighbor對(duì)象,也可以返回帶有KNN和SNN的Graph對(duì)象列表。當(dāng)在Seurat 對(duì)象上運(yùn)行時(shí),這將返回帶有存儲(chǔ)在各自slots中的Graph或Neighbor對(duì)象的Seurat對(duì)象。圖形或鄰居對(duì)象的名稱(chēng)可以在圖形或鄰居中找到。 - Examples
data("pbmc_small")
pbmc_small
Compute an SNN on the gene expression level
pbmc_small <- FindNeighbors(pbmc_small, features = VariableFeatures(object = pbmc_small))
More commonly, we build the SNN on a dimensionally reduced form of the data
such as the first 10 principle components.
pbmc_small <- FindNeighbors(pbmc_small, reduction = "pca", dims = 1:10)
1.KNN是什么?
所謂K最近鄰,就是K個(gè)最近的鄰居的意思,說(shuō)的是每個(gè)樣本都可以用它最接近的K個(gè)鄰近值來(lái)代表。近鄰算法就是將數(shù)據(jù)集合中每一個(gè)記錄進(jìn)行分類(lèi)的方法。
2.jaccard index是什么?
雅卡爾指數(shù) (Jaccard index)或者稱(chēng)為交并比、雅卡爾相似系數(shù)
可以用于比較樣本集的相似性與多樣性。其定義為兩個(gè)集合交集大小與并集大小之間的比例:
定義
給定兩個(gè)集合A,B,Jaccard 系數(shù)定義為A與B交集的大小與A與B并集的大小的比值,定義如下:
當(dāng)A和B是空集時(shí),定義jaccard index為1。
雅卡爾距離(Jaccard distance)
則用于量度樣本集之間的不相似度,其定義為1減去雅卡爾系數(shù)。Jaccard 距離越大,樣本相似度越低。公式定義如下:其中對(duì)參差(symmetric difference)
性質(zhì)
-
相似性
非對(duì)稱(chēng)二元屬性的相似性
在數(shù)據(jù)挖掘領(lǐng)域,常常需要比較兩個(gè)具有布爾值屬性的對(duì)象之間的距離,Jaccard距離就是常用的一種方法。給定兩個(gè)比較對(duì)象A,B。A, B 均有n個(gè)二元屬性,即每個(gè)屬性取值為{0,1}。定義如下4個(gè)統(tǒng)計(jì)量:
如圖1數(shù)示:
顯然有
Jaccard 系數(shù):
Jaccard距離:
對(duì)于非對(duì)稱(chēng)二元屬性而言(比如說(shuō)對(duì)于患癌癥和不患癌癥的屬性而言,不患癌癥是0,患癌癥是1,那么0的數(shù)量遠(yuǎn)遠(yuǎn)大于1,但是我們卻更關(guān)注1的數(shù)量):
M11 - A和B中都是1;
M01 - A中是0,B中是1;
M10 - A中是1,B中是0;
M00 - 兩者都是0.
3.如何鄰居判定?annoy、rann
-
Annoy是高維空間求近似最近鄰的一個(gè)開(kāi)源模塊。
Annoy構(gòu)建一棵二叉樹(shù),查詢(xún)時(shí)間為O(logn)。
Annoy通過(guò)隨機(jī)挑選兩個(gè)點(diǎn),并使用垂直于這個(gè)點(diǎn)的等距離超平面將集合劃分為兩部分。
如圖所示,圖中灰色線(xiàn)是連接兩個(gè)點(diǎn),超平面是加粗的黑線(xiàn)。按照這個(gè)方法在每個(gè)子集上迭代進(jìn)行劃分。
依此類(lèi)推,直到每個(gè)集合最多剩余k個(gè)點(diǎn),下圖是一個(gè)k = 10 的情況。
相應(yīng)的完整二叉樹(shù)結(jié)構(gòu):
隨機(jī)投影森林。
一個(gè)思想依據(jù)是:在原空間中相鄰的點(diǎn),在樹(shù)結(jié)構(gòu)上也表現(xiàn)出相互靠近的特點(diǎn),也就是說(shuō),如果兩個(gè)點(diǎn)在空間上相互靠近,那么他們很可能被樹(shù)結(jié)構(gòu)劃分到一起。
如果要在空間中查找臨近點(diǎn),我們可以在這個(gè)二叉樹(shù)中搜索。上圖中每個(gè)節(jié)點(diǎn)用超平面來(lái)定義,所以我們可以計(jì)算出該節(jié)點(diǎn)往哪個(gè)方向遍歷,搜索時(shí)間 log n
技巧1:使用優(yōu)先隊(duì)列
如果一個(gè)劃分的兩邊“靠得足夠近”(量化方式在后面介紹),我們就兩邊都遍歷。這樣就不只是遍歷一個(gè)節(jié)點(diǎn)的一邊,我們將遍歷更多的點(diǎn)
我們可以設(shè)置一個(gè)閾值,用來(lái)表示是否愿意搜索劃分“錯(cuò)”的一遍。如果設(shè)置為0,我們將總是遍歷“對(duì)”的一片。但是如果設(shè)置成0.5,就按照上面的搜索路徑。
這個(gè)技巧實(shí)際上是利用優(yōu)先級(jí)隊(duì)列,依據(jù)兩邊的最大距離。好處是我們能夠設(shè)置比0大的閾值,逐漸增加搜索范圍。
技巧2:構(gòu)建一個(gè)森林
我們能夠用一個(gè)優(yōu)先級(jí)隊(duì)列,同時(shí)搜索所有的樹(shù)。這樣有另外一個(gè)好處,搜索會(huì)聚焦到那些與已知點(diǎn)靠得最近的那些樹(shù)——能夠把距離最遠(yuǎn)的空間劃分出去
每棵樹(shù)都包含所有的點(diǎn),所以當(dāng)我們搜索多棵樹(shù)的時(shí)候,將找到多棵樹(shù)上的多個(gè)點(diǎn)。如果我們把所有的搜索結(jié)果的葉子節(jié)點(diǎn)都合在一起,那么得到的最近鄰就非常符合要求。依照上述方法,我們找到一個(gè)近鄰的集合,接下來(lái)就是計(jì)算所有的距離和對(duì)這些點(diǎn)進(jìn)行排序,找到最近的k個(gè)點(diǎn)。
很明顯,我們會(huì)丟掉一些最近的點(diǎn),這也是為什么叫近似最近鄰的原因。
Annoy在實(shí)際使用的時(shí)候,提供了一種機(jī)制可以調(diào)整(搜索k),你能夠根據(jù)它來(lái)權(quán)衡性能(時(shí)間)和準(zhǔn)確度(質(zhì)量)。
tips:
1.距離計(jì)算,采用歸一化的歐氏距離:vectors = sqrt(2-2*cos(u, v)) 即余弦距離
2.向量維度較小(<100),即使維度到達(dá)1000變現(xiàn)也不錯(cuò)
3.內(nèi)存占用小
4.索引創(chuàng)建與查找分離(特別是一旦樹(shù)已經(jīng)創(chuàng)建,就不能添加更多項(xiàng))
5.有兩個(gè)參數(shù)可以用來(lái)調(diào)節(jié)Annoy 樹(shù)的數(shù)量n_trees和搜索期間檢查的節(jié)點(diǎn)數(shù)量search_k
n_trees在構(gòu)建時(shí)提供,并影響構(gòu)建時(shí)間和索引大小。 較大的值將給出更準(zhǔn)確的結(jié)果,但更大的索引。
search_k在運(yùn)行時(shí)提供,并影響搜索性能。 較大的值將給出更準(zhǔn)確的結(jié)果,但將需要更長(zhǎng)的時(shí)間返回。
如果不提供search_k,它將默認(rèn)為n * n_trees,其中n是近似最近鄰的數(shù)目。 否則,search_k和n_tree大致是獨(dú)立的,即如果search_k保持不變,n_tree的值不會(huì)影響搜索時(shí)間,反之亦然。 基本上,建議在可用負(fù)載量的情況下盡可能大地設(shè)置n_trees,并且考慮到查詢(xún)的時(shí)間限制,建議將search_k設(shè)置為盡可能大。
from annoy import AnnoyIndex
import random
f = 40 #維度
t = AnnoyIndex(f) # Length of item vector that will be indexed
for i in xrange(1000):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v) #添加向量
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors of the first(0) vec
seurat4的annoy的源代碼:
# Run annoy
#
# @param data Data to build the index with
# @param query A set of data to be queried against data
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances
# @param index optional index object, will be recomputed if not provided
#
AnnoyNN <- function(data,
query = data,
metric = "euclidean",
n.trees = 50,
k,
search.k = -1,
include.distance = TRUE,
index = NULL
) {
idx <- index %||% AnnoyBuildIndex(
data = data,
metric = metric,
n.trees = n.trees)
nn <- AnnoySearch(
index = idx,
query = query,
k = k,
search.k = search.k,
include.distance = include.distance)
nn$idx <- idx
nn$alg.info <- list(metric = metric, ndim = ncol(x = data))
return(nn)
}
# Build the annoy index
#
# @param data Data to build the index with
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
#
#' @importFrom RcppAnnoy AnnoyEuclidean AnnoyAngular AnnoyManhattan AnnoyHamming
#
AnnoyBuildIndex <- function(data, metric = "euclidean", n.trees = 50) {
f <- ncol(x = data)
a <- switch(
EXPR = metric,
"euclidean" = new(Class = RcppAnnoy::AnnoyEuclidean, f),
"cosine" = new(Class = RcppAnnoy::AnnoyAngular, f),
"manhattan" = new(Class = RcppAnnoy::AnnoyManhattan, f),
"hamming" = new(Class = RcppAnnoy::AnnoyHamming, f),
stop ("Invalid metric")
)
for (ii in seq(nrow(x = data))) {
a$addItem(ii - 1, data[ii, ])
}
a$build(n.trees)
return(a)
}
# Search an Annoy approximate nearest neighbor index
#
# @param Annoy index, built with AnnoyBuildIndex
# @param query A set of data to be queried against the index
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances in the result
#
# @return A list with 'nn.idx' (for each element in 'query', the index of the
# nearest k elements in the index) and 'nn.dists' (the distances of the nearest
# k elements)
#
#' @importFrom future plan
#' @importFrom future.apply future_lapply
#
AnnoySearch <- function(index, query, k, search.k = -1, include.distance = TRUE) {
n <- nrow(x = query)
idx <- matrix(nrow = n, ncol = k)
dist <- matrix(nrow = n, ncol = k)
convert <- methods::is(index, "Rcpp_AnnoyAngular")
if (!inherits(x = plan(), what = "multicore")) {
oplan <- plan(strategy = "sequential")
on.exit(plan(oplan), add = TRUE)
}
res <- future_lapply(X = 1:n, FUN = function(x) {
res <- index$getNNsByVectorList(query[x, ], k, search.k, include.distance)
# Convert from Angular to Cosine distance
if (convert) {
res$dist <- 0.5 * (res$dist * res$dist)
}
list(res$item + 1, res$distance)
})
for (i in 1:n) {
idx[i, ] <- res[[i]][[1]]
if (include.distance) {
dist[i, ] <- res[[i]][[2]]
}
}
return(list(nn.idx = idx, nn.dists = dist))
}
- rann包
KD樹(shù)法—大規(guī)模點(diǎn)集
這個(gè)包只有一個(gè)函數(shù)nn2。功能十分強(qiáng)大,支持很高的維數(shù)(文檔里說(shuō)20維以上可能不太好了,但還是可以試試),返回值包括了k級(jí)最鄰近點(diǎn)的序號(hào)和k級(jí)最鄰近點(diǎn)距離兩個(gè)內(nèi)容,非常實(shí)用。由于實(shí)際上調(diào)用的是c++庫(kù),所以速度非常快,我用了30w+點(diǎn)的點(diǎn)集,瞬間完成。
這包說(shuō)明詳見(jiàn)RANN package - RDocumentation
Finds the k nearest neighbours for every point in a given dataset in O(N log N) time using Arya and Mount's ANN library (v1.1.3). There is support for approximate as well as exact searches, fixed radius searches and bd as well as kd trees.
使用Arya和Mount的ANN庫(kù)(v1.1.3)在O(N log N)時(shí)間內(nèi)為給定數(shù)據(jù)集中的每個(gè)點(diǎn)查找k個(gè)最近的鄰居。支持近似和精確搜索,固定半徑搜索,bd和kd樹(shù)。
This package implements nearest neighbors for the Euclidean (L2) metric. For the Manhattan (L1) metric, install the RANN1 package.
這個(gè)包實(shí)現(xiàn)了歐幾里德(L2)度量的最近鄰。對(duì)于曼哈頓(L1)指標(biāo),安裝RANN1軟件包。
library(RANN)
p_co=st_coordinates(point_set)
#獲取點(diǎn)坐標(biāo)。由于這個(gè)包只返回歐式距離,
不是專(zhuān)門(mén)針對(duì)地理空間開(kāi)發(fā)的,
所以在計(jì)算前還是先把點(diǎn)坐標(biāo)轉(zhuǎn)換成投影坐標(biāo)系為宜。
d=nn2(p_co,p_co,k=2)
#k=2即為設(shè)定K級(jí)最鄰近點(diǎn),
因?yàn)檫@個(gè)包只支持計(jì)算兩個(gè)點(diǎn)集之間的最鄰近點(diǎn)關(guān)系,所以只能計(jì)算包含自身的2級(jí)最鄰近點(diǎn)。
該函數(shù)還有其他參數(shù)可以設(shè)定搜尋方法、查尋樹(shù)的類(lèi)型、搜索半徑等。
d_indice=d[[1]][,2]
#返回值包括兩個(gè)list元素,
第一個(gè)元素就是最鄰近點(diǎn)序號(hào)矩陣。
第二列就是我們要找的最鄰近點(diǎn)序號(hào)。
'RANN'提供一個(gè)快速的最近鄰搜索算法,能夠?qū)⑺惴ǖ臅r(shí)間復(fù)雜度縮短為O(n log n)。
L2正則化:
L1正則化就是在loss function后邊所加正則項(xiàng)為L(zhǎng)1范數(shù),加上L1范數(shù)容易得到稀疏解(0比較多)。L2正則化就是loss function后邊所加正則項(xiàng)為L(zhǎng)2范數(shù)的平方,加上L2正則相比于L1正則來(lái)說(shuō),得到的解比較平滑(不是稀疏),但是同樣能夠保證解中接近于0(但不是等于0,所以相對(duì)平滑)的維度比較多,降低模型的復(fù)雜度。