聚类算法概述(k-Means++/FCM/凝聚层次聚类/DBSCAN)

聚类算法概述(k-Means++/FCM/凝聚层次聚类/DBSCAN)

参考自初识聚类算法:K均值、凝聚层次聚类和DBSCAN模糊聚类FCM算法

聚类的目的

将数据划分为若干个簇,簇内相似性大,簇间相似性小,聚类效果好。用于从数据中提取信息和规律。

聚类的概念

  • 层次与划分:当允许存在子簇时,将数据按照层次划分,最终得到的是一颗树。树中包含的层次关系即为聚类划分的层次关系。各个子簇不重叠,每个元素都隶属于某个level的子簇中。
  • 互斥、重叠与模糊:这个概念的核心在于,所有集合元素都不完全隶属于任何一个簇,而是按照一定隶属度归属于所有簇。对于任意一个元素,其隶属度和一般为1。
  • 完全与部分:完全聚类要求所有数据元素都必须有隶属,而部分聚类则允许噪音存在,不隶属于任何簇。

簇的分类

  • 明显分离:不同簇间任意元素距离都大于簇内元素距离。从图像上观察是明显分离类型的簇。
  • 基于原型:任意元素与它所隶属的簇的簇中心(簇内元素集合的质心)的距离大于到其他簇中心的距离。
  • 基于图:图中节点为对象,弧权值为距离。类似于明显分离的定义或基于原型的定义,只是用弧权值代替了人为规定的距离。
  • 基于密度:基于密度的簇分类是较为常用,也是应用范围最为广泛的一种分类方法。元素的稠密程度决定了簇的分布。当存在并希望分辨噪声时,或簇形状不规则时,往往采用基于密度的簇分类。

常用的聚类分析算法

  • 基本k均值:即k-means算法。簇的分类是基于原型的。用于已知簇个数的情况,且要求簇的形状基本满足圆形,不能区分噪声。
  • 凝聚层次聚类:起初各个点为一个簇,而后按照距离最近凝聚,知道凝聚得到的簇个数满足用户要求。
  • DBscan:基于密度和划分的聚类方法。

聚类算法的基本思想

(1) 基本k均值聚类(hard c-means, HCM)

方法很简单,首先给出初始的几个簇中心。将所有元素按照到簇中心最近的归属原则,归属到各个簇。然后对各个簇求解新的簇中心(元素集合质心)。重复上述步骤直到质心不再明显变化后,即完成聚类。

采用何种距离可按照数据性质或项目要求。距离的分类可以参考A-star算法概述及其在游戏开发中的应用分析中提到的曼哈顿距离、对角线距离、欧几里得距离等。实际上相当于求解一个全局状态函数的最小值问题,状态函数是各个元素到最近簇中心的距离之和。

该算法的特点有如下几点:

  • 其一,不一定得到全局最优解,当初始簇中心不满足要求时,可能只能得到局部最优解,当然有学者通过一定的预处理使得得到的初始簇中心满足一定条件,从而能够得到全局最优解,并将方法名改为k-means++
  • 其二,不能排除噪声点对聚类的影响。
  • 其三,要求簇形状接近圆形。
  • 要求完全聚类的情况。

k-Means++

python代码

此代码使用的是k-means++算法,采用约定的方法使得到的初始聚类中心能够在后面的迭代过程中收敛到最优解。

1
import math
2
import collections
3
import random
4
import copy
5
import pylab
6
7
try:
8
    import psyco
9
    psyco.full()
10
except ImportError:
11
    pass
12
13
FLOAT_MAX = 1e100
14
15
class Point:
16
    __slots__ = ["x", "y", "group"]
17
    def __init__(self, x = 0, y = 0, group = 0):
18
        self.x, self.y, self.group = x, y, group
19
20
def generatePoints(pointsNumber, radius):
21
    points = [Point() for _ in xrange(pointsNumber)]
22
    for point in points:
23
        r = random.random() * radius
24
        angle = random.random() * 2 * math.pi
25
        point.x = r * math.cos(angle)
26
        point.y = r * math.sin(angle)
27
    return points
28
29
def solveDistanceBetweenPoints(pointA, pointB):
30
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
31
32
def getNearestCenter(point, clusterCenterGroup):
33
    minIndex = point.group
34
    minDistance = FLOAT_MAX
35
    for index, center in enumerate(clusterCenterGroup):
36
        distance = solveDistanceBetweenPoints(point, center)
37
        if (distance < minDistance):
38
            minDistance = distance
39
            minIndex = index
40
    return (minIndex, minDistance)
41
42
def kMeansPlusPlus(points, clusterCenterGroup):
43
    clusterCenterGroup[0] = copy.copy(random.choice(points))
44
    distanceGroup = [0.0 for _ in xrange(len(points))]
45
    sum = 0.0
46
    for index in xrange(1, len(clusterCenterGroup)):
47
        for i, point in enumerate(points):
48
            distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
49
            sum += distanceGroup[i]
50
        sum *= random.random()
51
        for i, distance in enumerate(distanceGroup):
52
            sum -= distance;
53
            if sum < 0:
54
                clusterCenterGroup[index] = copy.copy(points[i])
55
                break
56
    for point in points:
57
        point.group = getNearestCenter(point, clusterCenterGroup)[0]
58
    return
59
60
def kMeans(points, clusterCenterNumber):
61
    clusterCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
62
    kMeansPlusPlus(points, clusterCenterGroup)
63
    clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
64
    tolerableError, currentError = 5.0, FLOAT_MAX
65
    count = 0
66
    while currentError >= tolerableError:
67
        count += 1
68
        countCenterNumber = [0 for _ in xrange(clusterCenterNumber)]
69
        currentCenterGroup = [Point() for _ in xrange(clusterCenterNumber)]
70
        for point in points:
71
            currentCenterGroup[point.group].x += point.x
72
            currentCenterGroup[point.group].y += point.y
73
            countCenterNumber[point.group] += 1
74
        for index, center in enumerate(currentCenterGroup):
75
            center.x /= countCenterNumber[index]
76
            center.y /= countCenterNumber[index]
77
        currentError = 0.0
78
        for index, singleTrace in enumerate(clusterCenterTrace):
79
            singleTrace.append(currentCenterGroup[index])
80
            currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
81
            clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
82
        for point in points:
83
            point.group = getNearestCenter(point, clusterCenterGroup)[0]
84
    return clusterCenterGroup, clusterCenterTrace
85
86
def showClusterAnalysisResults(points, clusterCenterTrace):
87
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
88
    pylab.figure(figsize=(9, 9), dpi = 80)
89
    for point in points:
90
        color = ''
91
        if point.group >= len(colorStore):
92
            color = colorStore[-1]
93
        else:
94
            color = colorStore[point.group]
95
        pylab.plot(point.x, point.y, color)
96
    for singleTrace in clusterCenterTrace:
97
        pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
98
    pylab.show()
99
100
def main():
101
    clusterCenterNumber = 5
102
    pointsNumber = 2000
103
    radius = 10
104
    points = generatePoints(pointsNumber, radius)
105
    _, clusterCenterTrace = kMeans(points, clusterCenterNumber)
106
    showClusterAnalysisResults(points, clusterCenterTrace)
107
108
main()

(1)Extra 基于模糊数学的c均值聚类(FCM)

模糊c均值聚类(fuzzy c-means clustering)与硬划分k均值聚类相同,都是一种基于划分的聚类分析方法,但FCM是HCM的自然进阶版。与k均值聚类不同的是,模糊c均值聚类的点按照不同的隶属度ui隶属于不同的聚类中心vi,聚类的过程类似k均值聚类。(详见:模糊聚类FCM算法)

聚类步骤:

  • 初始化。采用k-means++的方法确定初始聚类中心,确保最优解。
  • 确定各个点对各个聚类中心的隶属度u(i,j)m为加权指数。公式如下:
  • u(i,j) = (sum(distance(point(j), center(i)) / distance(point(j), center(k)))^(1/(m-1)))^-1
  • 确定新的聚类中心,标记聚类中心变化轨迹。公式如下:
  • v(i) = sum(u(i,j)^m * point(j)) / sum(u(i,j)^m)
  • 判断聚类中心变化幅值是否小于给定的误差限。如不满足返回步骤2,否则退出循环。
  • 打印聚类中心轨迹和聚类结果。

FCM

python代码

1
import math
2
import collections
3
import random
4
import copy
5
import pylab
6
7
try:
8
    import psyco
9
    psyco.full()
10
except ImportError:
11
    pass
12
13
FLOAT_MAX = 1e100
14
15
class Point:
16
    __slots__ = ["x", "y", "group", "membership"]
17
    def __init__(self, clusterCenterNumber, x = 0, y = 0, group = 0):
18
        self.x, self.y, self.group = x, y, group
19
        self.membership = [0.0 for _ in xrange(clusterCenterNumber)]
20
21
def generatePoints(pointsNumber, radius, clusterCenterNumber):
22
    points = [Point(clusterCenterNumber) for _ in xrange(2 * pointsNumber)]
23
    count = 0
24
    for point in points:
25
        count += 1
26
        r = random.random() * radius
27
        angle = random.random() * 2 * math.pi
28
        point.x = r * math.cos(angle)
29
        point.y = r * math.sin(angle)
30
        if count == pointsNumber - 1:
31
            break
32
    for index in xrange(pointsNumber, 2 * pointsNumber):
33
        points[index].x = 2 * radius * random.random() - radius
34
        points[index].y = 2 * radius * random.random() - radius
35
    return points
36
    
37
38
def solveDistanceBetweenPoints(pointA, pointB):
39
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
40
41
def getNearestCenter(point, clusterCenterGroup):
42
    minIndex = point.group
43
    minDistance = FLOAT_MAX
44
    for index, center in enumerate(clusterCenterGroup):
45
        distance = solveDistanceBetweenPoints(point, center)
46
        if (distance < minDistance):
47
            minDistance = distance
48
            minIndex = index
49
    return (minIndex, minDistance)
50
51
def kMeansPlusPlus(points, clusterCenterGroup):
52
    clusterCenterGroup[0] = copy.copy(random.choice(points))
53
    distanceGroup = [0.0 for _ in xrange(len(points))]
54
    sum = 0.0
55
    for index in xrange(1, len(clusterCenterGroup)):
56
        for i, point in enumerate(points):
57
            distanceGroup[i] = getNearestCenter(point, clusterCenterGroup[:index])[1]
58
            sum += distanceGroup[i]
59
        sum *= random.random()
60
        for i, distance in enumerate(distanceGroup):
61
            sum -= distance;
62
            if sum < 0:
63
                clusterCenterGroup[index] = copy.copy(points[i])
64
                break
65
    return
66
67
def fuzzyCMeansClustering(points, clusterCenterNumber, weight):
68
    clusterCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
69
    kMeansPlusPlus(points, clusterCenterGroup)
70
    clusterCenterTrace = [[clusterCenter] for clusterCenter in clusterCenterGroup]
71
    tolerableError, currentError = 1.0, FLOAT_MAX
72
    while currentError >= tolerableError:
73
        for point in points:
74
            getSingleMembership(point, clusterCenterGroup, weight)
75
        currentCenterGroup = [Point(clusterCenterNumber) for _ in xrange(clusterCenterNumber)]
76
        for centerIndex, center in enumerate(currentCenterGroup):
77
            upperSumX, upperSumY, lowerSum = 0.0, 0.0, 0.0
78
            for point in points:
79
                membershipWeight = pow(point.membership[centerIndex], weight)
80
                upperSumX += point.x * membershipWeight
81
                upperSumY += point.y * membershipWeight
82
                lowerSum += membershipWeight
83
            center.x = upperSumX / lowerSum
84
            center.y = upperSumY / lowerSum
85
        # update cluster center trace
86
        currentError = 0.0
87
        for index, singleTrace in enumerate(clusterCenterTrace):
88
            singleTrace.append(currentCenterGroup[index])
89
            currentError += solveDistanceBetweenPoints(singleTrace[-1], singleTrace[-2])
90
            clusterCenterGroup[index] = copy.copy(currentCenterGroup[index])
91
    for point in points:
92
        maxIndex, maxMembership = 0, 0.0
93
        for index, singleMembership in enumerate(point.membership):
94
            if singleMembership > maxMembership:
95
                maxMembership = singleMembership
96
                maxIndex = index
97
        point.group = maxIndex
98
    return clusterCenterGroup, clusterCenterTrace
99
100
def getSingleMembership(point, clusterCenterGroup, weight):
101
    distanceFromPoint2ClusterCenterGroup = [solveDistanceBetweenPoints(point, clusterCenterGroup[index]) for index in xrange(len(clusterCenterGroup))]
102
    for centerIndex, singleMembership in enumerate(point.membership):
103
        sum = 0.0
104
        isCoincide = [False, 0]
105
        for index, distance in enumerate(distanceFromPoint2ClusterCenterGroup):
106
            if distance == 0:
107
                isCoincide[0] = True
108
                isCoincide[1] = index
109
                break
110
            sum += pow(float(distanceFromPoint2ClusterCenterGroup[centerIndex] / distance), 1.0 / (weight - 1.0))
111
        if isCoincide[0]:
112
            if isCoincide[1] == centerIndex:
113
                point.membership[centerIndex] = 1.0
114
            else:
115
                point.membership[centerIndex] = 0.0
116
        else:
117
            point.membership[centerIndex] = 1.0 / sum
118
119
def showClusterAnalysisResults(points, clusterCenterTrace):
120
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
121
    pylab.figure(figsize=(9, 9), dpi = 80)
122
    for point in points:
123
        color = ''
124
        if point.group >= len(colorStore):
125
            color = colorStore[-1]
126
        else:
127
            color = colorStore[point.group]
128
        pylab.plot(point.x, point.y, color)
129
    for singleTrace in clusterCenterTrace:
130
        pylab.plot([center.x for center in singleTrace], [center.y for center in singleTrace], 'k')
131
    pylab.show()
132
133
def main():
134
    clusterCenterNumber = 5
135
    pointsNumber = 2000
136
    radius = 10
137
    weight = 2
138
    points = generatePoints(pointsNumber, radius, clusterCenterNumber)
139
    _, clusterCenterTrace = fuzzyCMeansClustering(points, clusterCenterNumber, weight)
140
    showClusterAnalysisResults(points, clusterCenterTrace)
141
142
main()

该算法的特点有如下几点:

  • 主要特点与普通的k均值聚类类似。
  • 要求完全聚类,不能区分噪声点。
  • 聚类的中心符合度更高,但计算效率相对较低。
  • 采用了平滑参数隶属度的概念,使得各点的并不直接隶属于单个聚类中心。

(2) 凝聚层次聚类

初始状态各个元素各自为簇,每次合并簇间距离最小的簇。直到簇个数满足要求或合并超过90%。类似huffman树算法和查并集。上述距离的定义也有几种分类:包括簇间元素的最小距离,簇间元素的最大距离,和簇质心距离。

该算法的特点有如下几点:

  • 凝聚聚类耗费的存储空间相对于其他几种方法要高。
  • 可排除噪声点的干扰,但有可能噪声点分为一簇。
  • 适合形状不规则,不要求聚类完全的情况。
  • 合并操作不能撤销。
  • 应注意,合并操作必须有一个合并限制比例,否则可能发生过度合并导致所有分类中心聚集,造成聚类失败。

凝聚层次聚类

python代码

1
import math
2
import collections
3
import random
4
import copy
5
import pylab
6
7
try:
8
    import psyco
9
    psyco.full()
10
except ImportError:
11
    pass
12
13
FLOAT_MAX = 1e100
14
15
class Point:
16
    __slots__ = ["x", "y", "group"]
17
    def __init__(self, x = 0, y = 0, group = 0):
18
        self.x, self.y, self.group = x, y, group
19
20
def generatePoints(pointsNumber, radius):
21
    points = [Point() for _ in xrange(4 * pointsNumber)]
22
    originX = [-radius, -radius, radius, radius]
23
    originY = [-radius, radius, -radius, radius]
24
    count = 0
25
    countCenter = 0
26
    for index, point in enumerate(points):
27
        count += 1
28
        r = random.random() * radius
29
        angle = random.random() * 2 * math.pi
30
        point.x = r * math.cos(angle) + originX[countCenter]
31
        point.y = r * math.sin(angle) + originY[countCenter]
32
        point.group = index
33
        if count >= pointsNumber * (countCenter + 1):
34
            countCenter += 1    
35
    return points
36
37
def solveDistanceBetweenPoints(pointA, pointB):
38
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
39
40
def getDistanceMap(points):
41
    distanceMap = {}
42
    for i in xrange(len(points)):
43
        for j in xrange(i + 1, len(points)):
44
            distanceMap[str(i) + '#' + str(j)] = solveDistanceBetweenPoints(points[i], points[j])
45
    distanceMap = sorted(distanceMap.iteritems(), key=lambda dist:dist[1], reverse=False)
46
    return distanceMap
47
48
def agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber):
49
    unsortedGroup = {index: 1 for index in xrange(len(points))}
50
    for key, _ in distanceMap:
51
        lowIndex, highIndex = int(key.split('#')[0]), int(key.split('#')[1])
52
        if points[lowIndex].group != points[highIndex].group:
53
            lowGroupIndex = points[lowIndex].group
54
            highGroupIndex = points[highIndex].group
55
            unsortedGroup[lowGroupIndex] += unsortedGroup[highGroupIndex]
56
            del unsortedGroup[highGroupIndex]
57
            for point in points:
58
                if point.group == highGroupIndex:
59
                    point.group = lowGroupIndex
60
        if len(unsortedGroup) <= int(len(points) * mergeRatio):
61
            break
62
    sortedGroup = sorted(unsortedGroup.iteritems(), key=lambda group: group[1], reverse=True)
63
    topClusterCenterCount = 0
64
    print sortedGroup, len(sortedGroup)
65
    for key, _ in sortedGroup:
66
        topClusterCenterCount += 1
67
        for point in points:
68
            if point.group == key:
69
                point.group = -1 * topClusterCenterCount
70
        if topClusterCenterCount >= clusterCenterNumber:
71
            break
72
    return points
73
74
75
def showClusterAnalysisResults(points):
76
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
77
    pylab.figure(figsize=(9, 9), dpi = 80)
78
    for point in points:
79
        color = ''
80
        if point.group < 0:
81
            color = colorStore[-1 * point.group - 1]
82
        else:
83
            color = colorStore[-1]
84
        pylab.plot(point.x, point.y, color)
85
    pylab.show()
86
87
def main():
88
    clusterCenterNumber = 4
89
    pointsNumber = 500
90
    radius = 10
91
    mergeRatio = 0.025
92
    points = generatePoints(pointsNumber, radius)
93
    distanceMap = getDistanceMap(points)
94
    points = agglomerativeHierarchicalClustering(points, distanceMap, mergeRatio, clusterCenterNumber)
95
    showClusterAnalysisResults(points)
96
97
main()

(3) DBscan

DBscan是一种基于密度的聚类算法。因此首先应定义密度的概念。密度是以一个点为中心2*EPs边长的正方形区域内点的个数。并将不同密度的点划归为不同类型的点:

  • 当密度大于阈值MinPs时,称为核心点。
  • 当密度小于阈值MinPs,但领域内核心点的数量大于等于1,称为边界点。
  • 非核心点且非边界点,称为噪声点。

具体操作:

  • 将所有邻近的核心点划分到同一个簇中。
  • 将所有边界点划分到其领域内的核心点的簇中。
  • 噪声点不做归属处理。

该算法的特点有如下几点:

  • 可排除噪声点的干扰。
  • 适合形状不规则,不要求聚类完全的情况。
  • 合并操作不能撤销。
  • minPointsNumberWithinBoundaryEps决定了聚类的粒度和范围,当Eps增大或minPointsNumberWithinBoundary减小时,都会使聚类的粒度更粗,形成范围更大的簇。对于特定的问题,需要调整EpsminPointsNumberWithinBoundary以满足聚类的要求。
  • 基于密度的聚类一定程度上回避了距离的计算,可以提高效率。

dbscan

python代码

1
import math
2
import collections
3
import random
4
import copy
5
import pylab
6
7
try:
8
    import psyco
9
    psyco.full()
10
except ImportError:
11
    pass
12
13
FLOAT_MAX = 1e100
14
15
CORE_POINT_TYPE = -2
16
BOUNDARY_POINT_TYPE = 1 #ALL NONE-NEGATIVE INTEGERS CAN BE BOUNDARY POINT TYPE
17
OTHER_POINT_TYPE = -1
18
19
class Point:
20
    __slots__ = ["x", "y", "group", "pointType"]
21
    def __init__(self, x = 0, y = 0, group = 0, pointType = -1):
22
        self.x, self.y, self.group, self.pointType = x, y, group, pointType
23
24
def generatePoints(pointsNumber, radius):
25
    points = [Point() for _ in xrange(4 * pointsNumber)]
26
    originX = [-radius, -radius, radius, radius]
27
    originY = [-radius, radius, -radius, radius]
28
    count = 0
29
    countCenter = 0
30
    for index, point in enumerate(points):
31
        count += 1
32
        r = random.random() * radius
33
        angle = random.random() * 2 * math.pi
34
        point.x = r * math.cos(angle) + originX[countCenter]
35
        point.y = r * math.sin(angle) + originY[countCenter]
36
        point.group = index
37
        if count >= pointsNumber * (countCenter + 1):
38
            countCenter += 1    
39
    return points
40
41
def solveDistanceBetweenPoints(pointA, pointB):
42
    return (pointA.x - pointB.x) * (pointA.x - pointB.x) + (pointA.y - pointB.y) * (pointA.y - pointB.y)
43
44
def isInPointBoundary(centerPoint, customPoint, halfScale):
45
    return customPoint.x <= centerPoint.x + halfScale and customPoint.x >= centerPoint.x - halfScale and customPoint.y <= centerPoint.y + halfScale and customPoint.y >= centerPoint.y - halfScale
46
47
def getPointsNumberWithinBoundary(points, halfScale):
48
    pointsIndexGroupWithinBoundary = [[] for _ in xrange(len(points))]
49
    for centerIndex, centerPoint in enumerate(points):
50
        for index, customPoint in enumerate(points):
51
            if centerIndex != index and isInPointBoundary(centerPoint, customPoint, halfScale):
52
                pointsIndexGroupWithinBoundary[centerIndex].append(index)
53
    return pointsIndexGroupWithinBoundary
54
55
def decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber):
56
    for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
57
        if len(customPointsGroup) >= minPointsNumber:
58
            points[index].pointType = CORE_POINT_TYPE
59
    for index, customPointsGroup in enumerate(pointsIndexGroupWithinBoundary):
60
        if len(customPointsGroup) < minPointsNumber:
61
            for customPointIndex in customPointsGroup:
62
                if points[customPointIndex].pointType == CORE_POINT_TYPE:
63
                    points[index].pointType = customPointIndex
64
65
def mergeGroup(points, fromIndex, toIndex):
66
    for point in points:
67
        if point.group == fromIndex:
68
            point.group = toIndex
69
70
def dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber):
71
    countGroupsNumber = {index: 1 for index in xrange(len(points))}
72
    for index, point in enumerate(points):
73
        if point.pointType == CORE_POINT_TYPE:
74
            for customPointIndex in pointsIndexGroupWithinBoundary[index]:
75
                if points[customPointIndex].pointType == CORE_POINT_TYPE and points[customPointIndex].group != point.group:
76
                    countGroupsNumber[point.group] += countGroupsNumber[points[customPointIndex].group]
77
                    del countGroupsNumber[points[customPointIndex].group]
78
                    mergeGroup(points, points[customPointIndex].group, point.group)
79
        #point.pointType >= 0 means it is BOUNDARY_POINT_TYPE
80
        elif point.pointType >= 0:
81
            corePointGroupIndex = points[point.pointType].group
82
            countGroupsNumber[corePointGroupIndex] += countGroupsNumber[point.group]
83
            del countGroupsNumber[point.group]
84
            point.group = corePointGroupIndex
85
    countGroupsNumber = sorted(countGroupsNumber.iteritems(), key=lambda group: group[1], reverse=True)
86
    count = 0
87
    for key, _ in countGroupsNumber:
88
        count += 1
89
        for point in points:
90
            if point.group == key:
91
                point.group = -1 * count
92
        if count >= clusterCenterNumber:
93
            break
94
95
def showClusterAnalysisResults(points):
96
    colorStore = ['or', 'og', 'ob', 'oc', 'om', 'oy', 'ok']
97
    pylab.figure(figsize=(9, 9), dpi = 80)
98
    for point in points:
99
        color = ''
100
        if point.group < 0:
101
            color = colorStore[-1 * point.group - 1]
102
        else:
103
            color = colorStore[-1]
104
        pylab.plot(point.x, point.y, color)
105
    pylab.show()
106
107
def main():
108
    clusterCenterNumber = 4
109
    pointsNumber = 500
110
    radius = 10
111
    Eps = 2
112
    minPointsNumber = 18
113
    points = generatePoints(pointsNumber, radius)
114
    pointsIndexGroupWithinBoundary = getPointsNumberWithinBoundary(points, Eps)
115
    decidePointsType(points, pointsIndexGroupWithinBoundary, minPointsNumber)
116
    dbscan(points, pointsIndexGroupWithinBoundary, clusterCenterNumber)
117
    showClusterAnalysisResults(points)
118
119
main()

后记

在学习和分析过程中发现几点待解决的问题:

  • 其一,上述聚类过程都需要人为指定聚类中心数目,然而聚类的过程如果需人为干预,这可能是一个比较麻烦的问题。解决办法可以是采用多个候选聚类中心数目{i,i+1,...k},对于不同的聚类中心数目都会有对应的分析结果,再采用贝叶斯定理。另一方面,机器无法知道人所需要的聚类粒度和聚类数目,如果完全由机器确定,也是不合理的。
  • 其二,k-means聚类必须是完全聚类,对距离的选择也可以依据问题而定。
  • 其三,实际上凝聚层次聚类和基于密度的dbscan聚类都有一个合并的过程,对于这种合并最好的算法应该是查并集,其时间复杂度为O(n * f(n)),对于目前常见的大整数n,f(n) < 4。但如果过于追求效率,那么就违背了python语言开发和分析数据的优势。
  • 其四,凝聚层次聚类和基于密度的dbscan聚类都对合并的程度有一定要求。凝聚层次聚类通过mergeRatio来确定合并的比例;而dbscan是通过EpsminPointsNumber来确定聚类的粒度。

Stargazers over time

Stargazers over time

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×