k-Means++/FCM/凝聚層次聚類/DBSCAN

參考自初識聚類算法:K均值、凝聚層次聚類和DBSCAN模糊聚類FCM算法

聚類的目的

將數(shù)據(jù)劃分為若干個簇,簇內(nèi)相似性大,簇間相似性小,聚類效果好。用于從數(shù)據(jù)中提取信息和規(guī)律。

聚類的概念

  • 層次與劃分:當允許存在子簇時,將數(shù)據(jù)按照層次劃分,最終得到的是一顆樹。樹中包含的層次關系即為聚類劃分的層次關系。各個子簇不重疊,每個元素都隸屬于某個level的子簇中。
  • 互斥、重疊與模糊:這個概念的核心在于,所有集合元素都不完全隸屬于任何一個簇,而是按照一定隸屬度歸屬于所有簇。對于任意一個元素,其隸屬度和一般為1。
  • 完全與部分:完全聚類要求所有數(shù)據(jù)元素都必須有隸屬,而部分聚類則允許噪音存在,不隸屬于任何簇。

簇的分類

  • 明顯分離:不同簇間任意元素距離都大于簇內(nèi)元素距離。從圖像上觀察是明顯分離類型的簇。
  • 基于原型:任意元素與它所隸屬的簇的簇中心(簇內(nèi)元素集合的質(zhì)心)的距離大于到其他簇中心的距離。
  • 基于圖:圖中節(jié)點為對象,弧權(quán)值為距離。類似于明顯分離的定義或基于原型的定義,只是用弧權(quán)值代替了人為規(guī)定的距離。
  • 基于密度:基于密度的簇分類是較為常用,也是應用范圍最為廣泛的一種分類方法。元素的稠密程度決定了簇的分布。當存在并希望分辨噪聲時,或簇形狀不規(guī)則時,往往采用基于密度的簇分類。

常用的聚類分析算法

  • 基本k均值:即k-means算法。簇的分類是基于原型的。用于已知簇個數(shù)的情況,且要求簇的形狀基本滿足圓形,不能區(qū)分噪聲。
  • 凝聚層次聚類:起初各個點為一個簇,而后按照距離最近凝聚,知道凝聚得到的簇個數(shù)滿足用戶要求。
  • DBscan:基于密度和劃分的聚類方法。

聚類算法的基本思想

(1) 基本k均值聚類(hard c-means, HCM)

方法很簡單,首先給出初始的幾個簇中心。將所有元素按照到簇中心最近的歸屬原則,歸屬到各個簇。然后對各個簇求解新的簇中心(元素集合質(zhì)心)。重復上述步驟直到質(zhì)心不再明顯變化后,即完成聚類。

采用何種距離可按照數(shù)據(jù)性質(zhì)或項目要求。距離的分類可以參考A-star算法概述及其在游戲開發(fā)中的應用分析中提到的曼哈頓距離、對角線距離、歐幾里得距離等。實際上相當于求解一個全局狀態(tài)函數(shù)的最小值問題,狀態(tài)函數(shù)是各個元素到最近簇中心的距離之和。

該算法的特點有如下幾點:

  • 其一,不一定得到全局最優(yōu)解,當初始簇中心不滿足要求時,可能只能得到局部最優(yōu)解,當然有學者通過一定的預處理使得得到的初始簇中心滿足一定條件,從而能夠得到全局最優(yōu)解,并將方法名改為k-means++
  • 其二,不能排除噪聲點對聚類的影響。
  • 其三,要求簇形狀接近圓形。
  • 要求完全聚類的情況。
k-Means++

python代碼

此代碼使用的是k-means++算法,采用約定的方法使得到的初始聚類中心能夠在后面的迭代過程中收斂到最優(yōu)解。

import math
import collections
import random
import copy
import pylab

try:
    import psyco
    psyco.full()
except ImportError:
    pass

FLOAT_MAX = 1e100

class Point:
    __slots__ = ["x", "y", "group"]
    def __init__(self, x = 0, y = 0, group = 0):
        self.x, self.y, self.group = x, y, group

def generatePoints(pointsNumber, radius):
    points = [Point() for _ in xrange(pointsNumber)]
    for point in points:
        r = random.random() * radius
        angle = random.random() * 2 * math.pi
        point.x = r * math.cos(angle)
        point.y = r * math.sin(angle)
    return points

def solveDistanceBetweenPoints(pointA, pointB):
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)

def getNearestCenter(point, clusterCenterGroup):
    minIndex = point.group
    minDistance = FLOAT_MAX
    for index, center in enumerate(clusterCenterGroup):
        distance = solveDistanceBetweenPoints(point, center)
        if (distance < minDistance):
            minDistance = distance
            minIndex = index
    return (minIndex, minDistance)

def kMeansPlusPlus(points, clusterCenterGroup):
    clusterCenterGroup[0] = copy.copy(random.choice(points))
    distanceGroup = [0.0 for _ in xrange(len(points))]
    sum = 0.0
    for index in xrange(1, len(clusterCenterGroup)):
        for i, point in enumerate(points):
            distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
            sum += distanceGroup[i]
        sum *= random.random()
        for i, distance in enumerate(distanceGroup):
            sum -= distance;
            if sum < 0:
                clusterCenterGroup[index] = copy.copy(points[i])
                break
    for point in points:
        point.group = getNearestCenter(point, clusterCenterGroup)[0]
    return

def kMeans(points, clusterCenterNumber):
    clusterCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
    kMeansPlusPlus(points, clusterCenterGroup)
    clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
    tolerableError, currentError = 5.0, FLOAT_MAX
    count = 0
    while currentError >= tolerableError:
        count += 1
        countCenterNumber = [0 for _ in xrange(clusterCenterNumber)]
        currentCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
        for point in points:
            currentCenterGroup[point.group].x += point.x
            currentCenterGroup[point.group].y += point.y
            countCenterNumber[point.group] += 1
        for index, center in enumerate(currentCenterGroup):
            center.x /= countCenterNumber[index]
            center.y /= countCenterNumber[index]
        currentError = 0.0
        for index, singleTrace in enumerate(clusterCenterTrace):
            singleTrace.append(currentCenterGroup[index])
            currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
            clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
        for point in points:
            point.group = getNearestCenter(point, clusterCenterGroup)[0]
    return clusterCenterGroup, clusterCenterTrace

def showClusterAnalysisResults(points, clusterCenterTrace):
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
    pylab.figure(figsize=(9, 9), dpi = 80)
    for point in points:
        color = ''
        if point.group >= len(colorStore):
            color = colorStore[-1]
        else:
            color = colorStore[point.group]
        pylab.plot(point.x, point.y, color)
    for singleTrace in clusterCenterTrace:
        pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
    pylab.show()

def main():
    clusterCenterNumber = 5
    pointsNumber = 2000
    radius = 10
    points = generatePoints(pointsNumber, radius)
    _, clusterCenterTrace = kMeans(points, clusterCenterNumber)
    showClusterAnalysisResults(points, clusterCenterTrace)

main()

(1)Extra 基于模糊數(shù)學的c均值聚類(FCM)

模糊c均值聚類(fuzzy c-means clustering)與硬劃分k均值聚類相同,都是一種基于劃分的聚類分析方法,但FCM是HCM的自然進階版。與k均值聚類不同的是,模糊c均值聚類的點按照不同的隸屬度ui隸屬于不同的聚類中心vi,聚類的過程類似k均值聚類。(詳見:模糊聚類FCM算法)

聚類步驟:

  • 初始化。采用k-means++的方法確定初始聚類中心,確保最優(yōu)解。
  • 確定各個點對各個聚類中心的隸屬度u(i,j)m為加權(quán)指數(shù)。公式如下:
  • u(i,j) = (sum(distance(point(j), center(i)) / distance(point(j), center(k)))^(1/(m-1)))^-1
  • 確定新的聚類中心,標記聚類中心變化軌跡。公式如下:
  • v(i) = sum(u(i,j)^m * point(j)) / sum(u(i,j)^m)
  • 判斷聚類中心變化幅值是否小于給定的誤差限。如不滿足返回步驟2,否則退出循環(huán)。
  • 打印聚類中心軌跡和聚類結(jié)果。
FCM

python代碼

import math
import collections
import random
import copy
import pylab

try:
    import psyco
    psyco.full()
except ImportError:
    pass

FLOAT_MAX = 1e100

class Point:
    __slots__ = ["x", "y", "group", "membership"]
    def __init__(self, clusterCenterNumber, x = 0, y = 0, group = 0):
        self.x, self.y, self.group = x, y, group
        self.membership = [0.0 for _ in xrange(clusterCenterNumber)]

def generatePoints(pointsNumber, radius, clusterCenterNumber):
    points = [Point(clusterCenterNumber) for _ in xrange(2 * pointsNumber)]
    count = 0
    for point in points:
        count += 1
        r = random.random() * radius
        angle = random.random() * 2 * math.pi
        point.x = r * math.cos(angle)
        point.y = r * math.sin(angle)
        if count == pointsNumber - 1:
            break
    for index in xrange(pointsNumber, 2 * pointsNumber):
        points[index].x = 2 * radius * random.random() - radius
        points[index].y = 2 * radius * random.random() - radius
    return points
    

def solveDistanceBetweenPoints(pointA, pointB):
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)

def getNearestCenter(point, clusterCenterGroup):
    minIndex = point.group
    minDistance = FLOAT_MAX
    for index, center in enumerate(clusterCenterGroup):
        distance = solveDistanceBetweenPoints(point, center)
        if (distance < minDistance):
            minDistance = distance
            minIndex = index
    return (minIndex, minDistance)

def kMeansPlusPlus(points, clusterCenterGroup):
    clusterCenterGroup[0] = copy.copy(random.choice(points))
    distanceGroup = [0.0 for _ in xrange(len(points))]
    sum = 0.0
    for index in xrange(1, len(clusterCenterGroup)):
        for i, point in enumerate(points):
            distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
            sum += distanceGroup[i]
        sum *= random.random()
        for i, distance in enumerate(distanceGroup):
            sum -= distance;
            if sum < 0:
                clusterCenterGroup[index] = copy.copy(points[i])
                break
    return

def fuzzyCMeansClustering(points, clusterCenterNumber, weight):
    clusterCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
    kMeansPlusPlus(points, clusterCenterGroup)
    clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
    tolerableError, currentError = 1.0, FLOAT_MAX
    while currentError >= tolerableError:
        for point in points:
            getSingleMembership(point, clusterCenterGroup, weight)
        currentCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
        for centerIndex, center in enumerate(currentCenterGroup):
            upperSumX, upperSumY, lowerSum = 0.0, 0.0, 0.0
            for point in points:
                membershipWeight = pow(point.membership[centerIndex], weight)
                upperSumX += point.x * membershipWeight
                upperSumY += point.y * membershipWeight
                lowerSum += membershipWeight
            center.x = upperSumX / lowerSum
            center.y = upperSumY / lowerSum
        # update cluster center trace
        currentError = 0.0
        for index, singleTrace in enumerate(clusterCenterTrace):
            singleTrace.append(currentCenterGroup[index])
            currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
            clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
    for point in points:
        maxIndex, maxMembership = 0, 0.0
        for index, singleMembership in enumerate(point.membership):
            if singleMembership > maxMembership:
                maxMembership = singleMembership
                maxIndex = index
        point.group = maxIndex
    return clusterCenterGroup, clusterCenterTrace

def getSingleMembership(point, clusterCenterGroup, weight):
    distanceFromPoint2ClusterCenterGroup = [solveDistanceBetweenPoints(point, clusterCenterGroup[index]) for index in xrange(len(clusterCenterGroup))]
    for centerIndex, singleMembership in enumerate(point.membership):
        sum = 0.0
        isCoincide = [False, 0]
        for index, distance in enumerate(distanceFromPoint2ClusterCenterGroup):
            if distance == 0:
                isCoincide[0] = True
                isCoincide[1] = index
                break
            sum += pow(float(distanceFromPoint2ClusterCenterGroup[centerIndex] / distance), 1.0 / (weight - 1.0))
        if isCoincide[0]:
            if isCoincide[1] == centerIndex:
                point.membership[centerIndex] = 1.0
            else:
                point.membership[centerIndex] = 0.0
        else:
            point.membership[centerIndex] = 1.0 / sum

def showClusterAnalysisResults(points, clusterCenterTrace):
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
    pylab.figure(figsize=(9, 9), dpi = 80)
    for point in points:
        color = ''
        if point.group >= len(colorStore):
            color = colorStore[-1]
        else:
            color = colorStore[point.group]
        pylab.plot(point.x, point.y, color)
    for singleTrace in clusterCenterTrace:
        pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
    pylab.show()

def main():
    clusterCenterNumber = 5
    pointsNumber = 2000
    radius = 10
    weight = 2
    points = generatePoints(pointsNumber, radius, clusterCenterNumber)
    _, clusterCenterTrace = fuzzyCMeansClustering(points, clusterCenterNumber, weight)
    showClusterAnalysisResults(points, clusterCenterTrace)

main()

該算法的特點有如下幾點:

  • 主要特點與普通的k均值聚類類似。
  • 要求完全聚類,不能區(qū)分噪聲點。
  • 聚類的中心符合度更高,但計算效率相對較低。
  • 采用了平滑參數(shù)隸屬度的概念,使得各點的并不直接隸屬于單個聚類中心。

(2) 凝聚層次聚類

初始狀態(tài)各個元素各自為簇,每次合并簇間距離最小的簇。直到簇個數(shù)滿足要求或合并超過90%。類似huffman樹算法和查并集。上述距離的定義也有幾種分類:包括簇間元素的最小距離,簇間元素的最大距離,和簇質(zhì)心距離。

該算法的特點有如下幾點:

  • 凝聚聚類耗費的存儲空間相對于其他幾種方法要高。
  • 可排除噪聲點的干擾,但有可能噪聲點分為一簇。
  • 適合形狀不規(guī)則,不要求聚類完全的情況。
  • 合并操作不能撤銷。
  • 應注意,合并操作必須有一個合并限制比例,否則可能發(fā)生過度合并導致所有分類中心聚集,造成聚類失敗。
凝聚層次聚類

python代碼

import math
import collections
import random
import copy
import pylab

try:
    import psyco
    psyco.full()
except ImportError:
    pass

FLOAT_MAX = 1e100

class Point:
    __slots__ = ["x", "y", "group"]
    def __init__(self, x = 0, y = 0, group = 0):
        self.x, self.y, self.group = x, y, group

def generatePoints(pointsNumber, radius):
    points = [Point() for _ in xrange(4 * pointsNumber)]
    originX = [-radius, -radius, radius, radius]
    originY = [-radius, radius, -radius, radius]
    count = 0
    countCenter = 0
    for index, point in enumerate(points):
        count += 1
        r = random.random() * radius
        angle = random.random() * 2 * math.pi
        point.x = r * math.cos(angle) + originX[countCenter]
        point.y = r * math.sin(angle) + originY[countCenter]
        point.group = index
        if count >= pointsNumber * (countCenter + 1):
            countCenter += 1    
    return points

def solveDistanceBetweenPoints(pointA, pointB):
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)

def getDistanceMap(points):
    distanceMap = {}
    for i in xrange(len(points)):
        for j in xrange(i + 1, len(points)):
            distanceMap[str(i) + '#' + str(j)] = solveDistanceBetweenPoints(points[i], points[j])
    distanceMap = sorted(distanceMap.iteritems(), key=lambda dist:dist[1], reverse=False)
    return distanceMap

def agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber):
    unsortedGroup = {index: 1 for index in xrange(len(points))}
    for key, _ in distanceMap:
        lowIndex, highIndex = int(key.split('#')[0]), int(key.split('#')[1])
        if points[lowIndex].group != points[highIndex].group:
            lowGroupIndex = points[lowIndex].group
            highGroupIndex = points[highIndex].group
            unsortedGroup[lowGroupIndex] += unsortedGroup[highGroupIndex]
            del unsortedGroup[highGroupIndex]
            for point in points:
                if point.group == highGroupIndex:
                    point.group = lowGroupIndex
        if len(unsortedGroup) <= int(len(points) * mergeRatio):
            break
    sortedGroup = sorted(unsortedGroup.iteritems(), key=lambda group: group[1], reverse=True)
    topClusterCenterCount = 0
    print sortedGroup, len(sortedGroup)
    for key, _ in sortedGroup:
        topClusterCenterCount += 1
        for point in points:
            if point.group == key:
                point.group = -1 * topClusterCenterCount
        if topClusterCenterCount >= clusterCenterNumber:
            break
    return points


def showClusterAnalysisResults(points):
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
    pylab.figure(figsize=(9, 9), dpi = 80)
    for point in points:
        color = ''
        if point.group < 0:
            color = colorStore[-1 * point.group - 1]
        else:
            color = colorStore[-1]
        pylab.plot(point.x, point.y, color)
    pylab.show()

def main():
    clusterCenterNumber = 4
    pointsNumber = 500
    radius = 10
    mergeRatio = 0.025
    points = generatePoints(pointsNumber, radius)
    distanceMap = getDistanceMap(points)
    points = agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber)
    showClusterAnalysisResults(points)

main()

(3) DBscan

DBscan是一種基于密度的聚類算法。因此首先應定義密度的概念。密度是以一個點為中心2*EPs邊長的正方形區(qū)域內(nèi)點的個數(shù)。并將不同密度的點劃歸為不同類型的點:

  • 當密度大于閾值MinPs時,稱為核心點。
  • 當密度小于閾值MinPs,但領域內(nèi)核心點的數(shù)量大于等于1,稱為邊界點。
  • 非核心點且非邊界點,稱為噪聲點。

具體操作:

  • 將所有鄰近的核心點劃分到同一個簇中。
  • 將所有邊界點劃分到其領域內(nèi)的核心點的簇中。
  • 噪聲點不做歸屬處理。

該算法的特點有如下幾點:

  • 可排除噪聲點的干擾。
  • 適合形狀不規(guī)則,不要求聚類完全的情況。
  • 合并操作不能撤銷。
  • minPointsNumberWithinBoundaryEps決定了聚類的粒度和范圍,當Eps增大或minPointsNumberWithinBoundary減小時,都會使聚類的粒度更粗,形成范圍更大的簇。對于特定的問題,需要調(diào)整EpsminPointsNumberWithinBoundary以滿足聚類的要求。
  • 基于密度的聚類一定程度上回避了距離的計算,可以提高效率。
dbscan

python代碼

import math
import collections
import random
import copy
import pylab

try:
    import psyco
    psyco.full()
except ImportError:
    pass

FLOAT_MAX = 1e100

CORE_POINT_TYPE = -2
BOUNDARY_POINT_TYPE = 1 #ALL NONE-NEGATIVE INTEGERS CAN BE BOUNDARY POINT TYPE
OTHER_POINT_TYPE = -1

class Point:
    __slots__ = ["x", "y", "group", "pointType"]
    def __init__(self, x = 0, y = 0, group = 0, pointType = -1):
        self.x, self.y, self.group, self.pointType = x, y, group, pointType

def generatePoints(pointsNumber, radius):
    points = [Point() for _ in xrange(4 * pointsNumber)]
    originX = [-radius, -radius, radius, radius]
    originY = [-radius, radius, -radius, radius]
    count = 0
    countCenter = 0
    for index, point in enumerate(points):
        count += 1
        r = random.random() * radius
        angle = random.random() * 2 * math.pi
        point.x = r * math.cos(angle) + originX[countCenter]
        point.y = r * math.sin(angle) + originY[countCenter]
        point.group = index
        if count >= pointsNumber * (countCenter + 1):
            countCenter += 1    
    return points

def solveDistanceBetweenPoints(pointA, pointB):
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)

def isInPointBoundary(centerPoint, customPoint, halfScale):
    return customPoint.x <= centerPoint.x + halfScale and customPoint.x >= centerPoint.x - halfScale and customPoint.y <= centerPoint.y + halfScale and customPoint.y >= centerPoint.y - halfScale

def getPointsNumberWithinBoundary(points, halfScale):
    pointsIndexGroupWithinBoundary = [[] for _ in xrange(len(points))]
    for centerIndex, centerPoint in enumerate(points):
        for index, customPoint in enumerate(points):
            if centerIndex != index and isInPointBoundary(centerPoint, customPoint, halfScale):
                pointsIndexGroupWithinBoundary[centerIndex].append(index)
    return pointsIndexGroupWithinBoundary

def decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber):
    for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
        if len(customPointsGroup) >= minPointsNumber:
            points[index].pointType = CORE_POINT_TYPE
    for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
        if len(customPointsGroup) < minPointsNumber:
            for customPointIndex in customPointsGroup:
                if points[customPointIndex].pointType == CORE_POINT_TYPE:
                    points[index].pointType = customPointIndex

def mergeGroup(points, fromIndex, toIndex):
    for point in points:
        if point.group == fromIndex:
            point.group = toIndex

def dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber):
    countGroupsNumber = {index: 1 for index in xrange(len(points))}
    for index, point in enumerate(points):
        if point.pointType == CORE_POINT_TYPE:
            for customPointIndex in pointsIndexGroupWithinBoundary[index]:
                if points[customPointIndex].pointType == CORE_POINT_TYPE and points[customPointIndex].group != point.group:
                    countGroupsNumber[point.group] += countGroupsNumber[points[customPointIndex].group]
                    del countGroupsNumber[points[customPointIndex].group]
                    mergeGroup(points, points[customPointIndex].group, point.group)
        #point.pointType >= 0 means it is BOUNDARY_POINT_TYPE
        elif point.pointType >= 0:
            corePointGroupIndex = points[point.pointType].group
            countGroupsNumber[corePointGroupIndex] += countGroupsNumber[point.group]
            del countGroupsNumber[point.group]
            point.group = corePointGroupIndex
    countGroupsNumber = sorted(countGroupsNumber.iteritems(), key=lambda group: group[1], reverse=True)
    count = 0
    for key, _ in countGroupsNumber:
        count += 1
        for point in points:
            if point.group == key:
                point.group = -1 * count
        if count >= clusterCenterNumber:
            break

def showClusterAnalysisResults(points):
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
    pylab.figure(figsize=(9, 9), dpi = 80)
    for point in points:
        color = ''
        if point.group < 0:
            color = colorStore[-1 * point.group - 1]
        else:
            color = colorStore[-1]
        pylab.plot(point.x, point.y, color)
    pylab.show()

def main():
    clusterCenterNumber = 4
    pointsNumber = 500
    radius = 10
    Eps = 2
    minPointsNumber = 18
    points = generatePoints(pointsNumber, radius)
    pointsIndexGroupWithinBoundary = getPointsNumberWithinBoundary(points, Eps)
    decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber)
    dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber)
    showClusterAnalysisResults(points)

main()

后記

在學習和分析過程中發(fā)現(xiàn)幾點待解決的問題:

  • 其一,上述聚類過程都需要人為指定聚類中心數(shù)目,然而聚類的過程如果需人為干預,這可能是一個比較麻煩的問題。解決辦法可以是采用多個候選聚類中心數(shù)目{i,i+1,...k},對于不同的聚類中心數(shù)目都會有對應的分析結(jié)果,再采用貝葉斯定理。另一方面,機器無法知道人所需要的聚類粒度和聚類數(shù)目,如果完全由機器確定,也是不合理的。
  • 其二,k-means聚類必須是完全聚類,對距離的選擇也可以依據(jù)問題而定。
  • 其三,實際上凝聚層次聚類和基于密度的dbscan聚類都有一個合并的過程,對于這種合并最好的算法應該是查并集,其時間復雜度為O(n * f(n)),對于目前常見的大整數(shù)n,f(n) < 4。但如果過于追求效率,那么就違背了python語言開發(fā)和分析數(shù)據(jù)的優(yōu)勢。
  • 其四,凝聚層次聚類和基于密度的dbscan聚類都對合并的程度有一定要求。凝聚層次聚類通過mergeRatio來確定合并的比例;而dbscan是通過EpsminPointsNumber來確定聚類的粒度。
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,316評論 6 531
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 98,481評論 3 415
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 176,241評論 0 374
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,939評論 1 309
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,697評論 6 409
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 55,182評論 1 324
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,247評論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 42,406評論 0 288
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 48,933評論 1 334
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 40,772評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,973評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,516評論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 44,209評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,638評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,866評論 1 285
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,644評論 3 391
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,953評論 2 373

推薦閱讀更多精彩內(nèi)容