參考自初識聚類算法:K均值、凝聚層次聚類和DBSCAN,模糊聚類FCM算法。
聚類的目的
將數(shù)據(jù)劃分為若干個簇,簇內(nèi)相似性大,簇間相似性小,聚類效果好。用于從數(shù)據(jù)中提取信息和規(guī)律。
聚類的概念
- 層次與劃分:當允許存在子簇時,將數(shù)據(jù)按照層次劃分,最終得到的是一顆樹。樹中包含的層次關系即為聚類劃分的層次關系。各個子簇不重疊,每個元素都隸屬于某個level的子簇中。
- 互斥、重疊與模糊:這個概念的核心在于,所有集合元素都不完全隸屬于任何一個簇,而是按照一定隸屬度歸屬于所有簇。對于任意一個元素,其隸屬度和一般為1。
- 完全與部分:完全聚類要求所有數(shù)據(jù)元素都必須有隸屬,而部分聚類則允許噪音存在,不隸屬于任何簇。
簇的分類
- 明顯分離:不同簇間任意元素距離都大于簇內(nèi)元素距離。從圖像上觀察是明顯分離類型的簇。
- 基于原型:任意元素與它所隸屬的簇的簇中心(簇內(nèi)元素集合的質(zhì)心)的距離大于到其他簇中心的距離。
- 基于圖:圖中節(jié)點為對象,弧權(quán)值為距離。類似于明顯分離的定義或基于原型的定義,只是用弧權(quán)值代替了人為規(guī)定的距離。
- 基于密度:基于密度的簇分類是較為常用,也是應用范圍最為廣泛的一種分類方法。元素的稠密程度決定了簇的分布。當存在并希望分辨噪聲時,或簇形狀不規(guī)則時,往往采用基于密度的簇分類。
常用的聚類分析算法
- 基本k均值:即k-means算法。簇的分類是基于原型的。用于已知簇個數(shù)的情況,且要求簇的形狀基本滿足圓形,不能區(qū)分噪聲。
- 凝聚層次聚類:起初各個點為一個簇,而后按照距離最近凝聚,知道凝聚得到的簇個數(shù)滿足用戶要求。
- DBscan:基于密度和劃分的聚類方法。
聚類算法的基本思想
(1) 基本k均值聚類(hard c-means, HCM)
方法很簡單,首先給出初始的幾個簇中心。將所有元素按照到簇中心最近的歸屬原則,歸屬到各個簇。然后對各個簇求解新的簇中心(元素集合質(zhì)心)。重復上述步驟直到質(zhì)心不再明顯變化后,即完成聚類。
采用何種距離可按照數(shù)據(jù)性質(zhì)或項目要求。距離的分類可以參考A-star算法概述及其在游戲開發(fā)中的應用分析中提到的曼哈頓距離、對角線距離、歐幾里得距離等。實際上相當于求解一個全局狀態(tài)函數(shù)的最小值問題,狀態(tài)函數(shù)是各個元素到最近簇中心的距離之和。
該算法的特點有如下幾點:
- 其一,不一定得到全局最優(yōu)解,當初始簇中心不滿足要求時,可能只能得到局部最優(yōu)解,當然有學者通過一定的預處理使得得到的初始簇中心滿足一定條件,從而能夠得到全局最優(yōu)解,并將方法名改為k-means++。
- 其二,不能排除噪聲點對聚類的影響。
- 其三,要求簇形狀接近圓形。
- 要求完全聚類的情況。
python代碼
此代碼使用的是k-means++算法,采用約定的方法使得到的初始聚類中心能夠在后面的迭代過程中收斂到最優(yōu)解。
import math
import collections
import random
import copy
import pylab
try:
import psyco
psyco.full()
except ImportError:
pass
FLOAT_MAX = 1e100
class Point:
__slots__ = ["x", "y", "group"]
def __init__(self, x = 0, y = 0, group = 0):
self.x, self.y, self.group = x, y, group
def generatePoints(pointsNumber, radius):
points = [Point() for _ in xrange(pointsNumber)]
for point in points:
r = random.random() * radius
angle = random.random() * 2 * math.pi
point.x = r * math.cos(angle)
point.y = r * math.sin(angle)
return points
def solveDistanceBetweenPoints(pointA, pointB):
return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
def getNearestCenter(point, clusterCenterGroup):
minIndex = point.group
minDistance = FLOAT_MAX
for index, center in enumerate(clusterCenterGroup):
distance = solveDistanceBetweenPoints(point, center)
if (distance < minDistance):
minDistance = distance
minIndex = index
return (minIndex, minDistance)
def kMeansPlusPlus(points, clusterCenterGroup):
clusterCenterGroup[0] = copy.copy(random.choice(points))
distanceGroup = [0.0 for _ in xrange(len(points))]
sum = 0.0
for index in xrange(1, len(clusterCenterGroup)):
for i, point in enumerate(points):
distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
sum += distanceGroup[i]
sum *= random.random()
for i, distance in enumerate(distanceGroup):
sum -= distance;
if sum < 0:
clusterCenterGroup[index] = copy.copy(points[i])
break
for point in points:
point.group = getNearestCenter(point, clusterCenterGroup)[0]
return
def kMeans(points, clusterCenterNumber):
clusterCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
kMeansPlusPlus(points, clusterCenterGroup)
clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
tolerableError, currentError = 5.0, FLOAT_MAX
count = 0
while currentError >= tolerableError:
count += 1
countCenterNumber = [0 for _ in xrange(clusterCenterNumber)]
currentCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
for point in points:
currentCenterGroup[point.group].x += point.x
currentCenterGroup[point.group].y += point.y
countCenterNumber[point.group] += 1
for index, center in enumerate(currentCenterGroup):
center.x /= countCenterNumber[index]
center.y /= countCenterNumber[index]
currentError = 0.0
for index, singleTrace in enumerate(clusterCenterTrace):
singleTrace.append(currentCenterGroup[index])
currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
for point in points:
point.group = getNearestCenter(point, clusterCenterGroup)[0]
return clusterCenterGroup, clusterCenterTrace
def showClusterAnalysisResults(points, clusterCenterTrace):
colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
pylab.figure(figsize=(9, 9), dpi = 80)
for point in points:
color = ''
if point.group >= len(colorStore):
color = colorStore[-1]
else:
color = colorStore[point.group]
pylab.plot(point.x, point.y, color)
for singleTrace in clusterCenterTrace:
pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
pylab.show()
def main():
clusterCenterNumber = 5
pointsNumber = 2000
radius = 10
points = generatePoints(pointsNumber, radius)
_, clusterCenterTrace = kMeans(points, clusterCenterNumber)
showClusterAnalysisResults(points, clusterCenterTrace)
main()
(1)Extra 基于模糊數(shù)學的c均值聚類(FCM)
模糊c均值聚類(fuzzy c-means clustering)與硬劃分k均值聚類相同,都是一種基于劃分的聚類分析方法,但FCM是HCM的自然進階版。與k均值聚類不同的是,模糊c均值聚類的點按照不同的隸屬度ui隸屬于不同的聚類中心vi,聚類的過程類似k均值聚類。(詳見:模糊聚類FCM算法)
聚類步驟:
- 初始化。采用k-means++的方法確定初始聚類中心,確保最優(yōu)解。
- 確定各個點對各個聚類中心的隸屬度u(i,j)。m為加權(quán)指數(shù)。公式如下:
u(i,j) = (sum(distance(point(j), center(i)) / distance(point(j), center(k)))^(1/(m-1)))^-1
- 確定新的聚類中心,標記聚類中心變化軌跡。公式如下:
v(i) = sum(u(i,j)^m * point(j)) / sum(u(i,j)^m)
- 判斷聚類中心變化幅值是否小于給定的誤差限。如不滿足返回步驟2,否則退出循環(huán)。
- 打印聚類中心軌跡和聚類結(jié)果。
python代碼
import math
import collections
import random
import copy
import pylab
try:
import psyco
psyco.full()
except ImportError:
pass
FLOAT_MAX = 1e100
class Point:
__slots__ = ["x", "y", "group", "membership"]
def __init__(self, clusterCenterNumber, x = 0, y = 0, group = 0):
self.x, self.y, self.group = x, y, group
self.membership = [0.0 for _ in xrange(clusterCenterNumber)]
def generatePoints(pointsNumber, radius, clusterCenterNumber):
points = [Point(clusterCenterNumber) for _ in xrange(2 * pointsNumber)]
count = 0
for point in points:
count += 1
r = random.random() * radius
angle = random.random() * 2 * math.pi
point.x = r * math.cos(angle)
point.y = r * math.sin(angle)
if count == pointsNumber - 1:
break
for index in xrange(pointsNumber, 2 * pointsNumber):
points[index].x = 2 * radius * random.random() - radius
points[index].y = 2 * radius * random.random() - radius
return points
def solveDistanceBetweenPoints(pointA, pointB):
return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
def getNearestCenter(point, clusterCenterGroup):
minIndex = point.group
minDistance = FLOAT_MAX
for index, center in enumerate(clusterCenterGroup):
distance = solveDistanceBetweenPoints(point, center)
if (distance < minDistance):
minDistance = distance
minIndex = index
return (minIndex, minDistance)
def kMeansPlusPlus(points, clusterCenterGroup):
clusterCenterGroup[0] = copy.copy(random.choice(points))
distanceGroup = [0.0 for _ in xrange(len(points))]
sum = 0.0
for index in xrange(1, len(clusterCenterGroup)):
for i, point in enumerate(points):
distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
sum += distanceGroup[i]
sum *= random.random()
for i, distance in enumerate(distanceGroup):
sum -= distance;
if sum < 0:
clusterCenterGroup[index] = copy.copy(points[i])
break
return
def fuzzyCMeansClustering(points, clusterCenterNumber, weight):
clusterCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
kMeansPlusPlus(points, clusterCenterGroup)
clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
tolerableError, currentError = 1.0, FLOAT_MAX
while currentError >= tolerableError:
for point in points:
getSingleMembership(point, clusterCenterGroup, weight)
currentCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
for centerIndex, center in enumerate(currentCenterGroup):
upperSumX, upperSumY, lowerSum = 0.0, 0.0, 0.0
for point in points:
membershipWeight = pow(point.membership[centerIndex], weight)
upperSumX += point.x * membershipWeight
upperSumY += point.y * membershipWeight
lowerSum += membershipWeight
center.x = upperSumX / lowerSum
center.y = upperSumY / lowerSum
# update cluster center trace
currentError = 0.0
for index, singleTrace in enumerate(clusterCenterTrace):
singleTrace.append(currentCenterGroup[index])
currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
for point in points:
maxIndex, maxMembership = 0, 0.0
for index, singleMembership in enumerate(point.membership):
if singleMembership > maxMembership:
maxMembership = singleMembership
maxIndex = index
point.group = maxIndex
return clusterCenterGroup, clusterCenterTrace
def getSingleMembership(point, clusterCenterGroup, weight):
distanceFromPoint2ClusterCenterGroup = [solveDistanceBetweenPoints(point, clusterCenterGroup[index]) for index in xrange(len(clusterCenterGroup))]
for centerIndex, singleMembership in enumerate(point.membership):
sum = 0.0
isCoincide = [False, 0]
for index, distance in enumerate(distanceFromPoint2ClusterCenterGroup):
if distance == 0:
isCoincide[0] = True
isCoincide[1] = index
break
sum += pow(float(distanceFromPoint2ClusterCenterGroup[centerIndex] / distance), 1.0 / (weight - 1.0))
if isCoincide[0]:
if isCoincide[1] == centerIndex:
point.membership[centerIndex] = 1.0
else:
point.membership[centerIndex] = 0.0
else:
point.membership[centerIndex] = 1.0 / sum
def showClusterAnalysisResults(points, clusterCenterTrace):
colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
pylab.figure(figsize=(9, 9), dpi = 80)
for point in points:
color = ''
if point.group >= len(colorStore):
color = colorStore[-1]
else:
color = colorStore[point.group]
pylab.plot(point.x, point.y, color)
for singleTrace in clusterCenterTrace:
pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
pylab.show()
def main():
clusterCenterNumber = 5
pointsNumber = 2000
radius = 10
weight = 2
points = generatePoints(pointsNumber, radius, clusterCenterNumber)
_, clusterCenterTrace = fuzzyCMeansClustering(points, clusterCenterNumber, weight)
showClusterAnalysisResults(points, clusterCenterTrace)
main()
該算法的特點有如下幾點:
- 主要特點與普通的k均值聚類類似。
- 要求完全聚類,不能區(qū)分噪聲點。
- 聚類的中心符合度更高,但計算效率相對較低。
- 采用了平滑參數(shù)和隸屬度的概念,使得各點的并不直接隸屬于單個聚類中心。
(2) 凝聚層次聚類
初始狀態(tài)各個元素各自為簇,每次合并簇間距離最小的簇。直到簇個數(shù)滿足要求或合并超過90%。類似huffman樹算法和查并集。上述距離的定義也有幾種分類:包括簇間元素的最小距離,簇間元素的最大距離,和簇質(zhì)心距離。
該算法的特點有如下幾點:
- 凝聚聚類耗費的存儲空間相對于其他幾種方法要高。
- 可排除噪聲點的干擾,但有可能噪聲點分為一簇。
- 適合形狀不規(guī)則,不要求聚類完全的情況。
- 合并操作不能撤銷。
- 應注意,合并操作必須有一個合并限制比例,否則可能發(fā)生過度合并導致所有分類中心聚集,造成聚類失敗。
python代碼
import math
import collections
import random
import copy
import pylab
try:
import psyco
psyco.full()
except ImportError:
pass
FLOAT_MAX = 1e100
class Point:
__slots__ = ["x", "y", "group"]
def __init__(self, x = 0, y = 0, group = 0):
self.x, self.y, self.group = x, y, group
def generatePoints(pointsNumber, radius):
points = [Point() for _ in xrange(4 * pointsNumber)]
originX = [-radius, -radius, radius, radius]
originY = [-radius, radius, -radius, radius]
count = 0
countCenter = 0
for index, point in enumerate(points):
count += 1
r = random.random() * radius
angle = random.random() * 2 * math.pi
point.x = r * math.cos(angle) + originX[countCenter]
point.y = r * math.sin(angle) + originY[countCenter]
point.group = index
if count >= pointsNumber * (countCenter + 1):
countCenter += 1
return points
def solveDistanceBetweenPoints(pointA, pointB):
return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
def getDistanceMap(points):
distanceMap = {}
for i in xrange(len(points)):
for j in xrange(i + 1, len(points)):
distanceMap[str(i) + '#' + str(j)] = solveDistanceBetweenPoints(points[i], points[j])
distanceMap = sorted(distanceMap.iteritems(), key=lambda dist:dist[1], reverse=False)
return distanceMap
def agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber):
unsortedGroup = {index: 1 for index in xrange(len(points))}
for key, _ in distanceMap:
lowIndex, highIndex = int(key.split('#')[0]), int(key.split('#')[1])
if points[lowIndex].group != points[highIndex].group:
lowGroupIndex = points[lowIndex].group
highGroupIndex = points[highIndex].group
unsortedGroup[lowGroupIndex] += unsortedGroup[highGroupIndex]
del unsortedGroup[highGroupIndex]
for point in points:
if point.group == highGroupIndex:
point.group = lowGroupIndex
if len(unsortedGroup) <= int(len(points) * mergeRatio):
break
sortedGroup = sorted(unsortedGroup.iteritems(), key=lambda group: group[1], reverse=True)
topClusterCenterCount = 0
print sortedGroup, len(sortedGroup)
for key, _ in sortedGroup:
topClusterCenterCount += 1
for point in points:
if point.group == key:
point.group = -1 * topClusterCenterCount
if topClusterCenterCount >= clusterCenterNumber:
break
return points
def showClusterAnalysisResults(points):
colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
pylab.figure(figsize=(9, 9), dpi = 80)
for point in points:
color = ''
if point.group < 0:
color = colorStore[-1 * point.group - 1]
else:
color = colorStore[-1]
pylab.plot(point.x, point.y, color)
pylab.show()
def main():
clusterCenterNumber = 4
pointsNumber = 500
radius = 10
mergeRatio = 0.025
points = generatePoints(pointsNumber, radius)
distanceMap = getDistanceMap(points)
points = agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber)
showClusterAnalysisResults(points)
main()
(3) DBscan
DBscan是一種基于密度的聚類算法。因此首先應定義密度的概念。密度是以一個點為中心2*EPs邊長的正方形區(qū)域內(nèi)點的個數(shù)。并將不同密度的點劃歸為不同類型的點:
- 當密度大于閾值MinPs時,稱為核心點。
- 當密度小于閾值MinPs,但領域內(nèi)核心點的數(shù)量大于等于1,稱為邊界點。
- 非核心點且非邊界點,稱為噪聲點。
具體操作:
- 將所有鄰近的核心點劃分到同一個簇中。
- 將所有邊界點劃分到其領域內(nèi)的核心點的簇中。
- 噪聲點不做歸屬處理。
該算法的特點有如下幾點:
- 可排除噪聲點的干擾。
- 適合形狀不規(guī)則,不要求聚類完全的情況。
- 合并操作不能撤銷。
-
minPointsNumberWithinBoundary
和Eps
決定了聚類的粒度和范圍,當Eps
增大或minPointsNumberWithinBoundary
減小時,都會使聚類的粒度更粗,形成范圍更大的簇。對于特定的問題,需要調(diào)整Eps
和minPointsNumberWithinBoundary
以滿足聚類的要求。 - 基于密度的聚類一定程度上回避了距離的計算,可以提高效率。
python代碼
import math
import collections
import random
import copy
import pylab
try:
import psyco
psyco.full()
except ImportError:
pass
FLOAT_MAX = 1e100
CORE_POINT_TYPE = -2
BOUNDARY_POINT_TYPE = 1 #ALL NONE-NEGATIVE INTEGERS CAN BE BOUNDARY POINT TYPE
OTHER_POINT_TYPE = -1
class Point:
__slots__ = ["x", "y", "group", "pointType"]
def __init__(self, x = 0, y = 0, group = 0, pointType = -1):
self.x, self.y, self.group, self.pointType = x, y, group, pointType
def generatePoints(pointsNumber, radius):
points = [Point() for _ in xrange(4 * pointsNumber)]
originX = [-radius, -radius, radius, radius]
originY = [-radius, radius, -radius, radius]
count = 0
countCenter = 0
for index, point in enumerate(points):
count += 1
r = random.random() * radius
angle = random.random() * 2 * math.pi
point.x = r * math.cos(angle) + originX[countCenter]
point.y = r * math.sin(angle) + originY[countCenter]
point.group = index
if count >= pointsNumber * (countCenter + 1):
countCenter += 1
return points
def solveDistanceBetweenPoints(pointA, pointB):
return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
def isInPointBoundary(centerPoint, customPoint, halfScale):
return customPoint.x <= centerPoint.x + halfScale and customPoint.x >= centerPoint.x - halfScale and customPoint.y <= centerPoint.y + halfScale and customPoint.y >= centerPoint.y - halfScale
def getPointsNumberWithinBoundary(points, halfScale):
pointsIndexGroupWithinBoundary = [[] for _ in xrange(len(points))]
for centerIndex, centerPoint in enumerate(points):
for index, customPoint in enumerate(points):
if centerIndex != index and isInPointBoundary(centerPoint, customPoint, halfScale):
pointsIndexGroupWithinBoundary[centerIndex].append(index)
return pointsIndexGroupWithinBoundary
def decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber):
for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
if len(customPointsGroup) >= minPointsNumber:
points[index].pointType = CORE_POINT_TYPE
for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
if len(customPointsGroup) < minPointsNumber:
for customPointIndex in customPointsGroup:
if points[customPointIndex].pointType == CORE_POINT_TYPE:
points[index].pointType = customPointIndex
def mergeGroup(points, fromIndex, toIndex):
for point in points:
if point.group == fromIndex:
point.group = toIndex
def dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber):
countGroupsNumber = {index: 1 for index in xrange(len(points))}
for index, point in enumerate(points):
if point.pointType == CORE_POINT_TYPE:
for customPointIndex in pointsIndexGroupWithinBoundary[index]:
if points[customPointIndex].pointType == CORE_POINT_TYPE and points[customPointIndex].group != point.group:
countGroupsNumber[point.group] += countGroupsNumber[points[customPointIndex].group]
del countGroupsNumber[points[customPointIndex].group]
mergeGroup(points, points[customPointIndex].group, point.group)
#point.pointType >= 0 means it is BOUNDARY_POINT_TYPE
elif point.pointType >= 0:
corePointGroupIndex = points[point.pointType].group
countGroupsNumber[corePointGroupIndex] += countGroupsNumber[point.group]
del countGroupsNumber[point.group]
point.group = corePointGroupIndex
countGroupsNumber = sorted(countGroupsNumber.iteritems(), key=lambda group: group[1], reverse=True)
count = 0
for key, _ in countGroupsNumber:
count += 1
for point in points:
if point.group == key:
point.group = -1 * count
if count >= clusterCenterNumber:
break
def showClusterAnalysisResults(points):
colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
pylab.figure(figsize=(9, 9), dpi = 80)
for point in points:
color = ''
if point.group < 0:
color = colorStore[-1 * point.group - 1]
else:
color = colorStore[-1]
pylab.plot(point.x, point.y, color)
pylab.show()
def main():
clusterCenterNumber = 4
pointsNumber = 500
radius = 10
Eps = 2
minPointsNumber = 18
points = generatePoints(pointsNumber, radius)
pointsIndexGroupWithinBoundary = getPointsNumberWithinBoundary(points, Eps)
decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber)
dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber)
showClusterAnalysisResults(points)
main()
后記
在學習和分析過程中發(fā)現(xiàn)幾點待解決的問題:
- 其一,上述聚類過程都需要人為指定聚類中心數(shù)目,然而聚類的過程如果需人為干預,這可能是一個比較麻煩的問題。解決辦法可以是采用多個候選聚類中心數(shù)目
{i,i+1,...k}
,對于不同的聚類中心數(shù)目都會有對應的分析結(jié)果,再采用貝葉斯定理。另一方面,機器無法知道人所需要的聚類粒度和聚類數(shù)目,如果完全由機器確定,也是不合理的。 - 其二,k-means聚類必須是完全聚類,對距離的選擇也可以依據(jù)問題而定。
- 其三,實際上凝聚層次聚類和基于密度的dbscan聚類都有一個合并的過程,對于這種合并最好的算法應該是查并集,其時間復雜度為
O(n * f(n))
,對于目前常見的大整數(shù)n,f(n) < 4
。但如果過于追求效率,那么就違背了python語言開發(fā)和分析數(shù)據(jù)的優(yōu)勢。 - 其四,凝聚層次聚類和基于密度的dbscan聚類都對合并的程度有一定要求。凝聚層次聚類通過
mergeRatio
來確定合并的比例;而dbscan是通過Eps
和minPointsNumber
來確定聚類的粒度。