聚类知识(Clustering)

Introduction

Cluster analysis is the process of grouping objects into subsets that have meaning in the context of a particular problem. The objects are thereby organized into an efficient representation that characterizes the population being sampled. Unlike classification, clustering does not rely on predefined classes. Clustering is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. It can uncover previously undetected relationships in a complex data set. Many applications for cluster analysis exist. For example, in a business application, cluster analysis can be used to discover and characterize customer groups for marketing purposes.

数据挖掘交友

Two types of clustering algorithms are nonhierarchical and hierarchical. In nonhierarchical clustering, such as the k-means algorithm, the relationship between clusters is undetermined. Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy. With both of these approaches, an important issue is how to determine the similarity between two objects, so that clusters can be formed from objects with a high similarity to each other. Commonly, distance functions, such as the Manhattan and Euclidian distance functions, are used to determine similarity. A distance function yields a higher value for pairs of objects that are less similar to one another. Sometimes a similarity function is used instead, which yields higher values for pairs that are more similar.


Distance Functions
数据挖掘交友

Given two p-dimensional data objects i = (xi,xi, ...,xip) and j = (xj,xj, ...,xjp), the following common distance functions can be defined: 2121 数据挖掘研究院

Euclidian Distance Function:


Manhattan Distance Function:
When using the Euclidian distance function to compare distances, it is not necessary to calculate the square root because distances are always positive numbers and as such, for two distances, d1 and d2, Öd1 > Öd2 Û d1 > d2. If some of an object′s attributes are measured along different scales, so when using the Euclidian distance function, attributes with larger scales of measurement may overwhelm attributes measured on a smaller scale. To prevent this problem, the attribute values are often normalized to lie between 0 and 1.

Other distance functions may be more appropriate for some data.




k-means Algorithm

数据挖掘研究院

The k-means algorithm is one of a group of algorithms called partitioning methods. The problem of partitional clustering can be formally stated as follows: Given n objects in a d-dimensional metric space, determine a partition of the objects into k groups, or clusters, such that the objects in a cluster are more similar to each other than to objects in different clusters. Recall that a partition divides a set into disjoint parts that together include all members of the set. The value of k may or may not be specified and a clustering criterion, typically the squared-error criterion, must be adopted.

数据挖掘实验室

The solution to this problem is straightforward. Select a clustering criterion, then for each data object select the cluster that optimizes the criterion. The k-means algorithm initializes k clusters by arbitrarily selecting one object to represent each cluster. Each of the remaining objects are assigned to a cluster and the clustering criterion is used to calculate the cluster mean. These means are used as the new cluster points and each object is reassigned to the cluster that it is most similar to. This continues until there is no longer a change when the clusters are recalculated. The algorithm is shown in Figure 1.

数据挖掘交友

k-means Algorithm:
  1. Select k clusters arbitrarily.
  2. Initialize cluster centers with those k clusters.
  3. Do loop
      a) Partition by assigning or reassigning all data objects to their closest cluster center.
      b) Compute new cluster centers as mean value of the objects in each cluster.
    Until no change in cluster center calculation
Figure 1: k-means Algorithm



Figure 2: Sample Data
Example: Suppose we are given the data in Figure 2 as input and we choose k=2 and the Manhattan distance function dis = |x-x1|+|y-y1|. The detailed computation is as follows:

数据挖掘论坛



Step 1. Initialize k partitions. An initial partition can be formed by first specifying a set of k seed points. The seed points could be the first k objects or k objects chosen randomly from the input objects. We choose the first two objects as seed points and initialize the clusters as C1={(0,0)} and C2={(0,1)}.
Step 2. Since there is only one point in each cluster, that point is the cluster center.
Step 3a. Calculate the distance between each object and each cluster center, assigning the object to the closest cluster.
    For example, for the third object:
        dis(1,3) = |1-0| + |1-0| = 2 and
        dis(2,3) = |1-0| + |1-1| = 1,
    so this object is assigned to C2.
    The fifth object is equidistant from both clusters, so we arbitrarily assign it to C1. 数据挖掘论坛
    After calculating the distance for all points, the clusters contain the following objects:
        C1 = {(0,0),(1,0),(0.5,0.5)} and
        C2 = {(0,1),(1,1),(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}.
Step 3b. Compute new cluster center for each cluster.
    New center for C1 = (0.5,0.16)
        (0+1+0.5)/3 = 0.5
        (0+0+0.5)/3 = 0.16
    New center for C2 = (4.1,4.2)
        (0+1+5+5+6+6+5.5)/7 = 4.1
        (1+1+5+5+6+6+5.5)/7 = 4.2
Step 3a′. New centers C1 = (0.5,0.16) and C2 = (4.1,4.2) differ from old centers C1={(0,0)} and C2={(0,1)}, so the loop is repeated.

数据挖掘交友


    Reassign the ten objects to the closest cluster center, resulting in:
         C1 = {(0,0),(0,1),(1,1),(1,0),(0.5,0.5)}
         C2 = {(5,5),(5,6),(6,6),(6,5),(5.5,5.5)}
Step 3b′. Compute new cluster center for each cluster
      New center for C1 = (0.5,0.5)
      New center for C2 = (5.5,5.5)
Step 3a′′. New centers C1 = (0.5,0.5) and C2 = (5.5,5.5) differ from old centers C1 = (0.5,0.16) and C2 = (4.1,4.2), so the loop is repeated.
    Reassign the ten objects to the closest cluster center. Result is same as in Step 5.
Step 3b′′. Compute new cluster centers.
Algorithm is done: Centers are the same as in Step 3b′, so algorithm is finished. Result is same as in 3b′.
2
2
数据挖掘论坛
Agglomerative Hierarchical Algorithm

Hierarchical algorithms can be either agglomerative or divisive, that is top-down or bottom-up. All agglomerative hierarchical clustering algorithms begin with each object as a separate group. These groups are successively combined based on similarity until there is only one group remaining or a specified termination condition is satisfied. For n objects, n-1 mergings are done. Hierarchical algorithms are rigid in that once a merge has been done, it cannot be undone. Although there are smaller computational costs with this, it can also cause problems if an erroneous merge is done. As such, merge points need to be chosen carefully. Here we describe a simple agglomerative clustering algorithm. More complex algorithms have been developed, such as BIRCH and CURE, in an attempt to improve the clustering quality of hierarchical algorithms. 数据挖掘实验室



Figure 3: Sample Dendogram

In the context of hierarchical clustering, the hierarchy graph is called a dendogram. Figure 3 shows a sample dendogram that could be produced from a hierarchical clustering algorithm. Unlike with the k-means algorithm, the number of clusters (k) is not specified in hierarchical clustering. After the hierarchy is built, the user can specify the number of clusters required, from 1 to n. The top level of the hierarchy represents one cluster, or k=1. To examine more clusters, we simply need to traverse down the hierarchy.

数据挖掘工具

Agglomerative Hierarchical Algorithm:

Given:
     A set X of objects {x1,...,xn}
     A distance function dis(c1,c2)
  1. for i = 1 to n
        ci = {xi}
    end for
  2. C = {c1,...,cb}
  3. l = n+1
  4. while C.size > 1 do
      a) (cmin,cmin) = minimum dis(ci,cj) for all ci,cj in C 数据挖掘交友
      b) remove cmin and cmin from C
      c) add {cmin,cmin} to C
      d) l = l + 1
    end while
    2
    1
    2
    1
    2
    1
Figure 4: Agglomerative Hierarchical Algorithm

Figure 4 shows a simple hierarchical algorithm. The distance function in this algorithm can determine similarity of clusters through many methods, including single link and group-average. Single link calculates the distance between two clusters as the shortest distance between any two objects contained in those clusters. Group-average first finds the average values for all objects in the group (i.e., cluster) and the calculates the distance between clusters as the distance between the average values. 数据挖掘论坛

Each object in X is initially used to create a cluster containing a single object. These clusters are successively merged into new clusters, which are added to the set of clusters, C. When a pair of clusters is merged, the original clusters are removed from C. Thus, the number of clusters in C decreases until there is only one cluster remaining, containing all the objects from X. The hierarchy of clusters is implicity represented in the nested sets of C.

数据挖掘实验室

Example: Suppose the input to the simple agglomerative algorithm described above is the set X, shown in Figure 5 represented in matrix and graph form. We will use the Manhattan distance function and the single link method for calculating distance between clusters. The set X contains n=10 elements, x1 to x10, where x1=(0,0).

Figure 5: Sample Data



Step 1. Initially, each element xi of X is placed in a cluster ci, where ci is a member of the set of clusters C.
    C = {{x1},{x2},{x3}, {x4},{x5},{x6},{x7}, {x8},{x9},{x10}} 数据挖掘工具

Step 2. Set l = 11.

Step 3. (First iteration of while loop) C.size = 10
  • The minimum single link distance between two clusters is 1. This occurs in two places, between c2 and c10 and between c3 and c10.
    Depending on how our minimum function works we can choose either pair of clusters. Arbitrarily we choose the first.
         (cmin,cmin) = (c2,c10) 21
  • Since l = 10, c11 = c2 U c10 = {{x2},{x10}}
  • Remove c2 and c10 from C.
  • Add c11 to C.
         C = {{x1},{x3}, {x4},{x5},{x6},{x7}, {x8},{x9},{{x2}, {x10}}}
  • Set l = l + 1 = 12
Step 3. (Second iteration) C.size = 9
  • The minimum single link distance between two clusters is 1. This occurs between, between c3 and c11 because the distance between x3 and x10 is 1, where x10 is in c11.
         (cmin,cmin) = (c3,c11) 21
  • c12 = c3 U c11 = {{{x2},{x10}},{x3}}
  • Remove c3 and c11 from C.
  • Add c12 to C.
         C = {{x1}, {x4},{x5},{x6},{x7}, {x8},{x9},{{{x2}, {x10}}, {x3}}}
  • Set l = 13
Step 3. (Third iteration) C.size = 8
  • (cmin,cmin) = (c1,c12) 21
  • C = {{x4},{x5},{x6},{x7}, {x8},{x9},{{{{x2}, {x10}}, {x3}},{x1}}}
Step 3. (Fourth iteration) C.size = 7
  • (cmin,cmin) = (c4,c8) 21
  • C = {{x5},{x6},{x7}, {x9},{{{{x2}, {x10}}, {x3}},{x1}},{{x4},{x8}}}
Step 3. (Fifth iteration) C.size = 6
  • (cmin,cmin) = (c5,c7) 21
  • C = {{x6}, {x9},{{{{x2}, {x10}}, {x3}},{x1}},{{x4},{x8}}, {{x5},{x7}}}
Step 3. (Sixth iteration) C.size = 5
  • (cmin,cmin) = (c9,c13) 21
  • C = {{x6}, {{x4},{x8}}, {{x5},{x7}},{{{{{x2}, {x10}}, {x3}},{x1}},{x9}}}
Step 3. (Seventh iteration) C.size = 4
  • (cmin,cmin) = (c6,c15) 21
  • C = {{{x4},{x8}}, {{{{{x2}, {x10}}, {x3}},{x1}},{x9}},{{x6}, {{x5},{x7}}}}
Step 3. (Eighth iteration) C.size = 3
  • (cmin,cmin) = (c14,c16) 21
  • C = { {{x6}, {{x5},{x7}}}, {{{x4},{x8}}, {{{{{x2}, {x10}}, {x3}},{x1}},{x9}}}}
Step 3. (Ninth iteration) C.size = 2
  • (cmin,cmin) = (c17,c18) 21
  • C = {{{{x4},{x8}}, {{{{{x2}, {x10}}, {x3}},{x1}},{x9}}}, {{x6}, {{x5},{x7}}}}
Step 3. (Tenth iteration) C.size = 1. Algorithm done. 数据挖掘交友

The cluster created from this algorithm can be seen in Figure 6. The corresponding dendogram formed from the hierarchy in C is shown in Figure 7. The points which appeared most closely together on the graph of input data in Figure 5 are grouped together more closely in the hierarchy.


Figure 6: Graph with Clusters


Figure 7: Sample Dendogram

资料全文下载

数据挖掘工具

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:数据挖掘常用聚类算法比较分析
下一篇:数 据 挖 课??聚 集(二)
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 模糊聚类分析
  • 聚类论文资源和源代码
  • Microsoft 聚类分析算法
  • 数据挖掘聚类算法一览
  • 层次聚类读书笔记
  • Microsoft 神经网络算法 (SSAS)
  • 聚类知识(Clustering)
  • 聚类技术优化
  • 可视化聚类读书笔记
  • 近邻测量( Measurement of Proximity)读
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 聚类分析工具
  • Google"s update to PageRank sparks
  • A Incremental Multi-Centroid, Multi-Run
  • Automatic Subspace Clustering of High Di
  • ROCK: A Robust Clustering Algorithm for
  • Data Clustering: A Review
  • 聚类技术优化
  • 近邻测量( Measurement of Proximity)读
  • 可视化聚类读书笔记
  • 层次聚类读书笔记
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静